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ABSTRACT 


In  the  four  principal  activities  of  data  base  design, 

1)  Identifiers  and  descriptors  must  be  grouped  together 
to  form  logical  records.  A suggested  standard  form  and  pro- 
cedure for  obtaining  this  form  are  described. 

2)  Relationships  among  logical  records  must  be  estab- 
lished by  logical  access  paths.  The  hierarchical,  CODASYL 
DBTG,  and  relational  models  and  the  situations  In  which  they 
are  applicable  are  described. 

3)  The  physical  form  of  records  must  be  determined. 
Several  techniques  for  reducing  storage  requirements  are 
described. 

4)  The  physical  realizations  of  logical  access  paths 
must  be  determined.  A general  technique  Is  described, 
examples  are  presented  and  evaluated,  and  specific  sug- 
gestions are  made  for  many  different  types  of  applications. 
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INTRODUCTION 


OBJECTIVES  AND  ORGANIZATION 


This  report  Is  a very  brief  survey  of  data  base  design.  The 


primary  objectives  are  to  present  various  alternatives  in  the  logical 


and  physical  phases  of  the  design  process,  and  to  discuss  the  Implications 


of  those  alternatives.  The  design  process  is  divided  into  four  phases 


described  in  corresponding  sections 


1)  the  design  of  logical  records 


2)  the  design  of  logical  access  paths 


3)  the  design  of  physical  records,  and 


4)  the  design  of  physical  access  paths 


These  terms  will  be  defined  briefly 


A "logical  record"  is  a collection  of  "fields"  which  is  to  be  mani- 


pulated, as  an  entity,  by  a user.  Each  field  has,  at  any  particular 


time,  a specific  value  which  may  serve  to  either  identify  (e.g.,  a social 


security  number)  or  describe  (e.g 


a home  address)  a real-world  object 


(e.g.,  an  employee)  being  represented  in  a data  base.  The  value  of  a 


field  may  be  a member  of  a "simple"  domain  (i.e 


set  of  quantities 


which  cannot  be  logically  subdivided)  or  a "complex"  domain  (e.g.,  the 


value  of  the  field  CHILDREN  could  be  a set  of  children's  names) 


A "logical  access  path"  is  a sequence  of  logical  records  which  can 


be  followed  to  get  from  one  object  (e.g.,  an  employee)  to  another,  related 


object  (e.g.,  the  department  in  which  he  or  she  works).  The  term  "data 


structure"  is  commonly  used,  but  has  the  unfortunate  connotation  that  the 


sequence  of  logical  records  is  inherent  in  the  data,  (kmplex  access  paths 
can,  however,  be  the  product  of  complex  algorithms  and  simple  date. 
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A "physical  record"  is  a collection  of  physically  contiguous  fields 
from  which  some  of  the  fields  of  one  or  more  logical  records  can  be 
determined.  A frequently  used  part  of  a logical  record  could  be  on  a 
faster  and  more  costly  peripheral  device  than  an  Infrequently  used  part. 
A logical  record  could  contain  an  employee's  age,  although  the  physical 
record  could  contain  his  or  her  birth  date. 

A "physical  access  path"  Is  a sequence  of  physical  records  corres- 
ponding to  a logical  access  path.  The  terra  "storage  structure"  is 
commonly  used,  but  has  the  connotation  of  being  Independent  of 
algorithm  or  process. 

GENERAL  REFERENCES 

James  Martin's  book,  "Computer  Data-Base  Organization"  (Prentice- 
Hall,  Englewood  Cliffs,  New  Jersey,  1975)  Is  an  excellent  source  of  both 
general  and  specific  Information  on  all  aspects  of  data  base  design. 
Numerous  diagrams,  examples,  and  summaries  make  the  book  exceptionally 
easy  to  use  as  a reference;  numerous  typographical  errors  will  hopefully 
be  corrected  In  later  editions.  Donald  E.  Knuth's  series  on  "The  Art  of 
Computer  Programming",  particularly  "Volume  3:  Sorting  and  Searching" 
(Addison-Wesley,  Reading,  Massachusetts,  1973)  Is  oriented  primarily 
toward  the  design  of  physical  access  paths;  In  this  area  It  Is  unequaled 
as  a collection  of  techniques  and  analyses  of  techniques. 

ABBREVIATIONS 

Only  standard  abbreviations  are  used  In  this  report,  except  for 
ACM  (the  Association  for  Computing  Machinery)  and  CACM  (the  Communi- 
cations of  the  ACM). 
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NOTATION 


DESIGN  OF  LOGICAL  RECORDS 


The  notation  of  n-ary  relatione  will  be  used  to  describe  logical 
records.^  For  exanple, 

PART  (PARTI,  PART-DESCRIPTION) 

represents  a type  of  record  called  PART  which  contains  fields  called 
PART#  and  PART-DESCRIPTION  plus  possibly  other  unspecified  fields.* 

Fields  must  have  simple  domains;  repeating  groups  are  not  allowed.  One 
or  more  fields  will  be  underlined;  each  continuous  underline  Indicates  a 
"candidate  key."  By  definition  a candidate  key  Is  a collection  of  fields 
which  uniquely  Identifies  a record  occurrence  (i.e.,  the  candidate  key 
has  a different  value  in  each  record  occurrence)  and  which  has  no  proper 
subcollection  which  is  a candidate  key  (I.e.,  all  fields  are  necessary 
for  unique  identification  of  a record  occurrence) . One  of  the  candidate 
keys  Is  designated  as  the  "primary  key"  (i.e.,  the  primary  Identifier 
of  the  record) . Every  record  must  have  a value  for  each  field  in  the 
primary  key;  other  fields  may  be  null,  to  indicate  that  the  value  is 
either  unknown  or  Irrelevant,  but  It  makes  very  little  sense  to  have  a 
record  describing  an  unidentified  object.  The  term  candidate  key  refers 
to  the  fact  that  the  key,  or  collection  of  fields.  Is  a candidate  to  be  a 
primary  key.  For  example, 

EMPLOYEE  (ID#,  SSI,  NAME) 

Indicates  that  an  employee  record  Is  uniquely  Identified  by  either  a 

1.  Codd,  E.  F.,  "Normalized  Data  Base  Structure:  A Brief  Tutorial," 

Proc.  1971  ACM-SIGFIDET  Workshop  on  Data  Description,  Access  and  Control, 
ACM,  New  York,  pp.  1-17  (1971). 

*The  terms  "relation,”  "tuple”  and  ”attrlbute”are  generally  used  In 
relational  literature  rather  than  "record  type,”  "record  occurrence,”  and 
"field”  respectively,  as  here.  The  latter  terms  are  used  because  of 
their  greater  familiarity  to  most  readers. 
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social  security  nunber  or  by  an  Identification  number,  but  not  neces- 

» 

sarlly  by  a name.  Either  ID#  or  SS#  could  be  the  primary  key.  However, 

PART-SUPPLIER  (Pi,  Si.  QUANTITY) 

Indicates  that  a PART-SUPPLIER  record  Is  uniquely  Identified  only  by 
the  combination  of  part  number  and  supplier. 

A common  domain  In  two  record  types  establishes  a logical  associa- 
tion or  access  path  between  them.*  For  example,  the  following  record 
type 

SKILL  (ID#,  SKILL-CODE.  RATING) 

Is  associated  with  the  EMPLOYEE  record  type  through  the  ID#.  If  the  data 
base  contains  the  following  records: 

EMPLOYEE  (ID#^,  S^,  NAME) 

001  123456789  JONES 

002  111223333  SMITH 

SKILL  (ID#,  SKILL-CODE.  RATING) 

< 

001  1 10 

001  8 5 

002  1 6 

002  3 10 

002  4 10 

2 

then  the  following  Information  can  be  obtained  by  "Joining"  EMPLOYEE  and 
SKILL  records  with  Identical  ID#  fields: 

*That  Is,  there  Is  a field  In  one  record  type  which  can  be  compared 
%rlth  a field  In  the  other  record  type.  Conversion  of  units  Is  permissible. 
For  example,  a %>elght  field  In  pounds  can  be  compared  with  a weight  field 
In  kilograms,  but  not  with  a part  number. 

2.  Codd,  E.  F.,  "Relational  Completeness  of  Data  Base  Sublanguages," 
Proc.  Courant  Computer  Science  Symposium  6:  Data  Base  Systems,  Prentice- 

Hall,  Englewood  Cliffs,  New  Jersey,  pp.  72-74  (1972). 
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EMP-SKILL  (SS#, 

SKILL-CODE. 

IDI, 

NAME, 

RATING) 

123456789 

1 

001 

JONES 

10 

123456789 

8 

001 

JONES 

5 

111223333 

1 

002 

SMITH 

6 

111223333 

3 

002 

SMITH 

10 

111223333 

4 

002 

SMITH 

10 

and  then  "projecting"  (Codd,  1972, 

pp.  71-72)^ 

onto 

the  SS# 

and  SKII. 

CODE  fields  of  the  result: 

SS#-SKILL  (SS#j SKILL-CODE) 

123456789  1 

123456789  8 

111223333  1 

111223333  3 

111223333  4 

If  the  records  were  joined  as  above  and  the  result  projected  onto  the  SSit 
and  RATING  fields,  the  result  would  be: 

SS#-RATING  (SSO,  RATING) 

123456789  10 

123456789  5 

111223333  6 

111223333  10 

The  duplicate  record  is  removed.  A field,  such  as  ID#  in  SKILL,  which  is 
a candidate  key  in  another  record  type,  such  as  EMPLOYEE,  is  called  a 
"foreign  key"  to  that  latter  record  type. 

This  notation  is  used  because  it  clearly  indicates  the  logical 
identifiers  and  content  of  the  record,  and  does  not  constrain  the  logical 
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associations  or  physical  Inplementatlon  of  the  data  base.  For  example, 
the  SKILL  record  type  could  be  physically  realized  as  a pair  of  fields 
(SKILL~CODE  and  RATING)  that  could  be  a repeating  group  within  EMPLOYEE, 
or  a record  type  under  EMPLOYEE  In  a hierarchy,  or  a member  of  a set 
owned  by  EMPLOYEE.*  If  the  SKILL  record  type  were  logically  represented 
as  a repeating  group  or  as  a part  of  a hierarchy.  It  would  not  be  logi- 
cally accessible  from  another  record  with  a common  domain.  If  It  were 
represented  as  a member  of  a set,  then  there  would  be  no  clear  Indication 
that  ID#  Is  an  essential  part  of  SKILL. 

A primary  key,  whether  a single  field  or  multiple  fields.  Identifies 
a real  world  object  (e.g.,  a physical  object  or  collection  of  physical 
objects,  a concept,  an  association,  or  an  event)  which  Is  being  represen- 
ted by  a record  In  the  data  base,  while  the  other  fields  In  the  record 
describe  that  object.  A non-prlmary,  candidate  key  Is  an  alternate 
Identifier.  Primary  keys  generally  behave  differently  from  other  fields 
during  update  operations:  primary  keys  are  added  or  removed  (by  Insertion 

and  deletion  operations  on  the  records),  whereas  other  fields  are  modified. 
Other  fields,  since  they  represent  Information  about  the  primary  key,  are 
frequently  much  more  volatile  than  the  primary  key  Itself.  For  example, 

PART  (P^,  COST,  QOH) 

represents  the  cost  and  quantity  on  hand  of  a given  part,  both  of  which 
would  normally  vary  much  more  frequently  than  the  part  number.  Note  that 
changing  the  part  number  may  be  a complex  operation,  since  P#  must  be 
changed  not  only  In  the  PART  record,  but  In  each  associated  PART-SUPPLIER 
record . 

*See  the  subsection  on  "CODASYL  DBTG  Owner-Coupled  Sets”  for  an 
explanation  of  the  final  alternative. 
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The  following  eubeection  will  present  the  concept  of  "functional 
dependence."  The  remainder  of  the  aection  will  consist  of  applications 
of  this  concept  to  the  design  of  logical  recorda.  Simple  rules,  based  on 
functional  dependence,  tend  to  aimplify  data  base  description,  access, 
and  maintenance,  and  hence  provide  an  important  tool  for  data  base 
design. 

FUNCTIONAL  DEPENDENCE 

A field  A is  said  to  be  "functionally  dependent"  on  a collection 
of  fields  C if,  at  any  particular  time,  the  value  of  A is  uniquely 
determined  by  C.  This  is  written  C — ♦ A.  For  example,  in  SUPPLIER-PART 
( S# , P# , QUANTITY),  QUANTITY  is  functionally  dependent  on  the  primary  key 
S# , P# , but  not  on  either  S#  or  P#  individually.  (There  may  be  only  one 
supplier  for  a particular  part,  but  functional  dependence  concerns  only 
those  constraints  which  must  hold  for  the  entire  data  base  at  all  times.) 

FIRST  NORMAL  FORM 

Two  sources  of  complexities  in  accessing  and  updating  a data  base 

are  fields  with  non-simple  domains  (e.g.,  repeating  groups),  and  irregular 

dependencies  among  fields.  This  subsection  considers  the  elimination  of 

non-simple  domains;  the  following  two  subsections  consider  a two-step 

process  for  eliminating  irregular  dependencies. 

A file  is  "flat"  if  all  domains  are  simple  and  all  records  consist 

of  the  same  fields.  Such  a file  can  be  represented  by  a two-dimensional 

tabulation  of  field  values,  and  is  in  "first  normal  form"  (Codd,  1971, 

1 

pp.  11-13).  The  primary  advantages  of  such  a file  are  that  its  logical 
structure  can  be  easily  described  and  understood  and  that  it  can  be 


manipulated  by  operations  which  are  intuitively  and  computationally 

2 

both  simple  and  powerful  (Codd,  1972,  pp.  65-98).  The  uniformity  of 
structure  encourages  the  user  to  organise  his  queries  or  programs  into 
simple  modules  corresponding  to  the  elementary  objects  in  his  model  of 
the  real  world.  Furthermore,  evolution  of  the  data  base  is  greatly 
simplified  by  modularity  and  simplicity  of  structure;  binary  relations 
may  be  of  great  advantage  in  a rapidly  evolving  environment  since  new 
fields  may  be  easily  added  to  the  data  base.* 

Elimination  of  repeating  groups  from  a record  structure  has  already 
been  illustrated  by  the  example  of  the  EMPLOYEE  and  SKILL  records.  In 
that  case,  the  employee's  SKILL  records  were  associated  with  his  EMPLOYEE 
record  by  means  of  a foreign  key,  the  identification  number.  The  physical 
representation  of  the  EMPLOYEE  record  might,  however,  contain  SKILL  as  a 
repeating  group.  As  another  example,  suppose  that  a department  record 
contains  the  department  number  and  a repeating  group  of  job  numbers.  In 
first  normal  form  this  could  be: 

DEPT  (DEPT#,  MANAGER#) . 

DEPT-JOB  (DEPT#, JOB#,  »JOB-DESC)  . 

Given  a particular  department,  its  jobs  are  represented  in  those  DEPT-JOB 
records  with  the  appropriate  value  in  the  foreign  key,  DEPT#.  Note  that 
any  n-level  hierarchical  structure  can  be  represented  by  n different 
record  types,  irhere  level  K can  be  represented  by  a record  type  whose 
primary  key  is  the  concatenation  of  the  primary  key  of  level  K-1  and 
additional  fields  identifying  a specific  record  at  level  K.  For  example, 
to  continue  the  DEPT  and  DEPT-JOB  hierarchy. 


*See  the  subsection  on  "Binary  Relations”. 
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DEPT-JOB-EMP  (DEPT#.  JOB#,  ID#.  EMP-DESC) . 

The  proliferation  of  fields  In  the  prlaary  key  does  not  laply  a long  and 
space-conaiaiing  key  In  the  physical  record,  since  this  is  only  a logical 
representation.  The  fields  in  the  primary  key  clearly  describe  the 
hierarchical  structure  without  arrows  or  other  graphic  devices,  and 
therefore  are  convenient  for  computer  documentation  of  the  data  base 
structure. 

SECOND  NORMAL  FORM 

Transforming  a file  into  first  normal  form  eliminates  processing 
complexities  due  to  non-simple  domains.  Other  complexities  may  be  caused 
by  Irregular  dependencies  among  fields.  This  subsection  discusses  the 
first  step  in  eliminating  those  irregular  dependencies;  the  following 
subsection  discusses  the  second  step. 

The  following  example  illustrates  one  type  of  irregular  dependency 
and  its  effects: 

FURNITURE  (NAME,  COLOR.  PRICE,  STOCK  #). 

Both  name  and  color  are  required  to  retrieve  a stock  number  for  furniture, 
but  price  is  functionally  dependent  on  name  alone.  At  a given  time  the 
file  might  consist  of: 


FURNITURE  (NAME, 

COLOR. 

PRICE, 

STOCK#) 

CHAIR 

BROWN 

$50 

0001 

CHAIR 

GRAY 

$50 

0002 

DESK 

BLACK 

$200 

0003 

DESK 

RED 

$200 

0004 

This  file  has  the  following  undesirable  properties:  1)  changing  the 

price  of  an  item  involves  changing  a separate  logical  record  for  each 
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color  of  the  Item,  2)  a new  line  of  furniture  (e.g.,  a cabinet)  cannot  be 
added  to  the  file  unless  a color  Is  known  or  a null  Is  specified  for  the 
color  (In  which  case  the  rr  -ord  must  be  replaced  when  a color  Is  known), 
and  3)  deleting  a color  will  or  will  not  delete  name  and  price,  depending 
on  whether  that  Is  the  only  color  available.  Codd  (1971,  p.l3)^  refers 
to  these  undesirable  properties  as  update.  Insertion,  and  deletion  ano- 
malies, respectively;  their  effect  Is  to  Introduce  opportunities  for 
error,  since  program  designers  must  anticipate  all  of  these  exceptional 
conditions.  These  problems  arise  because  price  is  functionally  dependent 
on  only  part  of  the  key. 

The  solution  Is  to  require  that  each  field  which  Is  not  part  of  a 
candidate  key  must  be  determined  only  by  the  entire  candidate  key. 

Records  satisfying  this  condition  are  in  "second  normal  form"  (Codd, 

1971,  pp.  13-14).^  Any  record  In  first  normal  form  may  be  subdivided 
into  smaller  records  which  are  In  second  normal  form.  For  the  example, 
the  result  will  be; 


: t 


I 


and 


FURN-PRICE  (NAME, 
CHAIR 
DESK 


PRICE) 

$50 

$200 


• i 

» 

1 

FURN-COLOR  (NAME, 

COLOR, 

STOCK#) 

j 

! 

CHAIR 

BROWN 

0001 

! 

CHAIR 

GRAY 

0002 

1 

1 

1 

DESK 

BLACK 

0003 

< 

i . 

DESK 

RED 

0004 

The  technique  Is  Identical  to  that  used  In  eliminating  repeating  groups, 
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Not*  that  th*  anoaall**  have  b**n  raaoved;  a prlc*  changa  involvaa 


only  ona  racord,  and 


naaa  and  prlc*  can  axiat  Indapandantly  of  color. 


; I 

V- 

' ■.! 


THIRD  NORMAL  FORM 

Another  type  of  irregularity  la  llluatrated  by  the  following  example: 


ITEM  (ITBItf. 

NAME, 

PRICE) 

001 

CHAIR 

$50 

002 

CHAIR 

$50 

003 

CHAIR 

$50 

004 

DESK 

$200 

003 

DESK 

$200 

The  Item  ntaber  uniquely  Identlflea 

a specific 

Item,  but  either  Item 

mnber  or  name  la  aufflcient  to  determine  the  price.  Aa  before,  a price 
change  affecta  a maber  of  recorda  (an  update  anomaly),  a new  type  of 
Item  cannot  be  added  to  the  data  baae  unleaa  a (poaalbly  null)  Item 
number  can  be  aaalgned  (an  Inaertlon  anomaly),  and  deleting  an  Individual 
Item  will  or  will  not  delete  all  pricing  information  about  that  type  of 
Item,  according  to  whether  It  waa  the  only  Item  of  that  type  (a  deletion 
anomaly).  The  problem  la  that,  although  PRICE  la  functionally  dependent 
on  the  entire  primary  key.  It  alao  la  functionally  dependent  on  another 
field,  NAME,  which  la  not  a candidate  key. 

The  aolutlon  la  to  divide  the  logical  record  Into  amaller  records 
auch  that  no  field  la  functionally  dependent  on  any  collection  of  fields 
which  la  not  a candidate  key.  Recorda  In  aecond  normal  form  which 
aatiafy  thla  additional  requirement  are  In  "third  normal  form"  (Codd, 
1971,  pp.  14-15)1  In  the  example,  the  reault  la: 
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I 


002  CHAIR 

003  CHAIR 

004  DESK 

005  DESK 

ITEM-PRICE  (NAME.  PRICE) 

CHAIR  $50 

DESK  $200 


Note  that  the  anonalles  have  Indeed  been  removed. 


The  advantage  of  third  normal  form  is  that  It  facilitates  the 

enforcement  of  those  consistency  constraints  which  can  be  described  by 

3 

means  of  functional  dependencies.  For  example,  the  contraint  that  PRICE 
is  functionally  dependent  on  NAME  is  easily  enforced  for  the  record  type 

ITEM-PRICE  (NAME.  PRICE). 

Any  PRICE  field  may  be  modified  without  violating  consistency;  each  NAME 
field  will  continue  to  be  associated  with  only  one  PRICE.  A new  record 
can  be  inserted  if  and  only  if  its  NAME  field  is  not  already  in  the  data 
base.  This  is  logically  and  physically  easy  to  check.  On  the  other  hand, 
the  consistency  of  the  ITEM  file  (not  in  third  normal  form)  is  much  more 
difficult  to  maintain.  If,  for  example,  the  price  of  item  number  001,  a 
chair,  should  change,  then  the  prices  of  items  002  and  003  would  also 
have  to  change.  If  a desk  were  to  be  Inserted  into  the  file,  then  its 

3.  It  has  been  argued  that  functional  dependencies  do  not  describe 
some  desirable  contraints.  Schmid,  H.  A.  and  R.  J.  Swenson,  "On  the 
Semantics  of  the  Relational  Data  Model,"  Proc.  1975  ACM-SIGMOD  Inter- 
national Conference  on  the  Management  of  Data,  ACM,  New  York,  pp.  211-223 
(1975). 
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price  vould  heve  to  be  compared  vith  the  price  of  a desk  already  in  the 
file,  %rtiich  could  be  time-consuming. 

Recall  the  observation  near  the  end  of  the  subsection  on  "Notation" 
that  a candidate  key  identifies  a real-world  object,  and  that  other 
fields  describe  that  object.  Second  normal  form  requires  that  descrip- 
tions apply  to  the  object  identified  by  the  entire  candidate  key,  rather 
than  an  object  identified  by  part  of  it.  In  the  example 

FURNITURE  (NAME,  COLOR.  PRICE), 

where  NAME  alone  determines  PRICE,  the  PRICE  field  describes  a larger 
class  of  furniture  (e.g.,  all  chairs)  than  that  identified  by  the  primary 
key  (chairs  of  a certain  color).  Third  normal  form  also  requires  that 
descriptions  apply  only  to  objects  identified  by  candidate  keys.  In  the 
example, 

ITEM  (ITEM#,  NAME,  PRICE) 

where  NAME  (as  well  as  ITEM#)  determines  the  PRICE  field,  price  again 
describes  a larger  class  of  furniture  (e.g.,  all  chairs)  than  that 
identified  by  ITENf  (a  particular  chair).  Hence,  third  normal  form  is  a 
formalization  of  a rule  which  is  intuitively  quite  reasonable. 

The  preceding  subsections  have  discussed  the  advantages  of  third 
normal  form.  The  following  subsection  presents  a technique  for  construc- 
ting logical  record  structures  from  statements  of  functional  dependence. 

SYNTHESIS  OF  RECORDS  IN  THIRD  NORMAL  FORM 

The  notation  C A was  introduced  earlier  to  indicate  that  A was 
functionally  dependent  on  a collection  of  fields  C.  Informally,  A 
describes  a real-world  object  identified  by  C.  For  example  in 

EMPLOYEE  (ID#.  SS#,NAME). 
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NAME  and  SSl  describe  the  employee  identified  by  ID#,  and  in 


SKILL  (ID#.  SKILL-CODE.  RATING) 


RATING  describes  the  employee's  skill  Identified  by  ID#  and  SKILL-CODE 


Assume,  then,  that  the  desired  relationships  of  the  fields  in  a data  base 


are  represented  by  a collection  of  functional  dependencies.  Assume  also 


that  each  left-hand  side  of  a dependency  is  as  simple  as  possible  - l.e 


no  fields  are  included  on  the  left  unless  they  are  necessary  to  Identify 


the  proper  object.  Thus,  ID#,  SS# — ^ NAME  is  true  but  can  be  Inferred 


from  the  simpler  expression,  ID# 


NAME 


A simple  and  straightforward  method  for  designing  logical  records  is 


to  associate  a record  type  with  each  functional  dependency;  that  is,  if 


C — >~k  then  a record  type,  say  R,  is  constructed  as  follows 


This  has  the  disadvantage  of  requiring  a large  number  of  record  types  and 


of  redundantly  representing  functional  dependencies.  For  example,  if 


EMP#  — ► DEPT,  EMP# — ►LOG,  and  DEPT — ►LOG,  then  the  record  types  would 


01P-DEPT  (EMP#,  DEPT) 


DEPT-LOC  (DEPT,  LOG) 


But  EMP-DEPT  and  DEPT-LOC  can  be  joined  on  DEPT  and  then  projected  onto 


EMP#  and  LOG  to  produce  a record  type  with  the  same  information  as 


EMP-LOC.  It  would  be  undesirable  for  QIP-LOC  to  be  a part  of  the  data 


base,  since  it  can  be  derived  from  EMP-DEPT  and  DEPT-LOC,  and  the  redun- 


dancy would  complicate  update  and  introduce  the  possibility  of  inconsis- 


tency in  the  data  base.  For  example,  suppose  that  the  data  base  contained 


« 


(BMP#. 

DEPT) 

001 

TOYS 

002 

TOYS 

003 

BOOKS 

004 

BOOKS 

005 

BABY 

(DEPT. 

LOC) 

TOYS 

FIRST 

BOOKS 

SECOND 

BABY 

FIRST 

(EMP#. 

LOC) 

001 

FIRST 

002 

FIRST 

003 

SECOND 

004 

SECOND 

005 

FIRST 

If  EMP-LOC  is  stored  In  the  data  base,  then  changing  the  LOC  of  TOYS  to 
SECOND  involves  changing  one  record  in  DEPT-hOC  and  two  records  in 
EMP-LOC.  The  latter  aust  be  identified  by  selecting  records  of  EMP-DEPT 
with  DEPT  equal  to  TOYS,  and  then  selecting  records  froa  EMP-LOC  with  the 
appropriate  values  (001  snd  002)  for  EHP#.  Note  that  each  of  the  three 
records  is  in  third  noraal  fora;  it  is  the  coabination  that  introduces 
update  probleas. 

The  general  fora  of  redundancy  is  staple,  as  described  below.  Let 
B,  7,  and  H be  lists  of  one  or  aore  fields,  and  G be  either  null  or  a 
list  of  one  or  aore  fields.  If  E — ^7  and  7,  G — H,  then  E,  G— *-H  is 


redundant,  since  it  can  be  Inferred  from  the  other  dependencies.  A 


special  case  Is  that  in  which  E— F — w- H,  and  hence  E — r-H  (l.e.,  G 


Is  null).  Such  redundancies  must,  then,  be  eliminated  from  the  collection 


of  functional  dependencies.  This  Involves  choosing  one  of  the  functional 


dependencies  after  another  and  discarding  It  If  It  can  be  derived  from 


the  remaining  functional  dependencies.  A reasonably  straightforward 


algorithm  Is  described  by  Bernstein,  Swenson  and  Tslchrltzls 


The  next  step  Is  to  combine  functional  dependencies  with  Identical 


left-hand  sides.  That  Is,  X— ►-Y  and  X— can  be  combined  to  form  X 


Y,Z.  Record  types  are  constructed  as  before,  with  left-hand  parts  of 


functional  dependencies  becoming  primary  keys.  Next,  records  are  combined 


If  their  primary  keys  are  dependent  on  each  other.  For  example 


can  be  simplified  to 


This  last  step  may  result  In  a record  type  which  Is  not  In  the  third 


normal  form.*  For  example 


Is  possible  and  leads  to 


4.  Bernstein,  P.  A.,  J.  R.  Swenson,  and  D.  C.  Tslchrltzls,  "A  Uni- 
fied Approach  to  Functional  Dependencies  and  Relations,"  Proc.  1975 
ACM-SIGMOD  International  Conference  on  the  Management  of  Data,  ACM,  New 
York,  pp.  237-245  (1975). 

*Thla  observation  waa  made  by  Bernstein  while  delivering  the  pre- 
viously cited  paper  at  the  1975  ACM-SIGMOD  Conference.  The  example  Is  a 
simplified  version  of  Bernstein's  example. 


and  of  course  R4  is  not  In  third  normal  form  since  E is  functionally 
dependent  o.n  A only  in  A,  B.  However,  it  suffices  to  eliminate  E from 
the  latter  record  type,  resulting  in 

R1  (A,  E) 

R4  (A,  B.  C,  D) 

In  general,  the  final  step  is  to  eliminate  such  extraneous  fields  to 
ensure  third  normal  form. 

SUMMARY  OF  LOGICAL  RECORD  DESIGN 

In  summary,  a reasonably  straightforward  procedure  exists  for 
designing  logical  records.  The  objective  is  to  begin  with  functional 
dependencies,  and  to  combine  fields  wherever  possible  without  introducing 
various  types  of  redundancies.  The  result  is  a simpler  data  base  whose 
consistency  can  be  more  efficiently  maintained  by  the  data  base  manage- 


ment system. 


DESIGN  OF  LOGICAL  ACCESS  PATHS 


The  purpose  of  a logical  access  path  la  to  provide  a means  for 
identifying  a record  or  collection  of  records.  If  every  desired  collec- 
tion could  be  anticipated  during  data  base  design  and  given  a unique 
name,  then  the  user  could  merely  provide  the  name  to  specify  the  data  he 
desired.  In  practical  situations,  however,  the  number  of  names  would  be 
far  too  large  for  either  the  user  or  the  data  base  management  system. 
Hence,  access  paths  to  complex  collections  of  records  must  be  constructed 
by  combining  simple  paths;  the  forms  of  these  simple  paths  and  the  ways 
of  combining  them  are  the  subjects  of  this  section. 

Logical  access  paths  which  are  short  and  match  the  user's  conception 
of  his  problem  are  very  desirable  for  the  user,  in  order  tha(  the  data  he 
requires  can  be  meaningfully  and  succinctly  described.  The  efficiency  of 
the  data  base  management  system  is  also  highly  dependent  upon  the  logical 
access  paths.  Paths  which  are  long  and  presented  piecemeal  to  the  data 
base  management  system  will  require  many  physical  record  accesses.  Short 
access  paths,  or  long  paths  which  are  described  in  a single  request,  are 
much  more  desirable.  Even  if  realized  by  long  physical  access  paths, 
such  logical  access  paths  allow  the  data  management  system  to  read  ahead 
and  retain  needed  records  in  faster  storage,  thus  improving  real-time 
response.  Seemingly  unrelated  requests  for  records  are  most  inefficient, 
while  the  retrieval  of  a large  well-defined  collection  can  be  optimized 
and  quite  efficient.  In  general,  the  more  difficult  it  is  to  express  a 
user's  intentions,  the  more  inefficiently  they  will  be  realized.  However, 
there  is  a price  to  be  paid  for  simplicity  in  expression:  physical 

access  paths  and  the  data  base  management  system  Itself  may  become  quite 
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complex.  This  is  not  surprising,  considering  Che  complexities  of  data 
base  management;  if  less  is  done  by  the  user,  then  more  must  be  done  by 
the  data  base  administrator  or  data  base  management  system. 

This  section  will  describe  three  different  models  of  logical  access 
paths:  1)  hierarchical,  2)  CODASYL  DBTG  owner-coupled  set  (or  simple 

network),  and  3)  relational.  The  advantages  and  disadvantages  of  each 
will  be  discussed. 

This  section  is  concerned  only  with  logical  descriptions  - or  views  - 
of  a data  base.^’^  Different  views  of  the  same  data  base  are  possible  and 
desirable,  since  they  permit  each  user  to  be  unaware  of  Chose  records, 
fields,  and  associations  which  do  not  concern  him.  Security  is  facili- 
tated by  providing  different  views  - a user  cannot  access  data  not  in  his 
> 

description  of  the  data  base.  Data  independence  is  also  facilitated  by 
providing  different  views  - the  program's  view  of  the  data  base  can 
frequently  be  held  constant  even  though  the  underlying  physical  data  base 
changes  because  of  new  applications,  hardware,  or  software.  Finally, 
on-line  access  to  the  data  base  by  non-programmers  may  be  greatly  facili- 
tated by  providing  different  views  - each  user  can  view  the  data  base 
through  names  and  associations  which  are  meaningful  to  him,  but  not 
necessarily  to  the  other  users.  However,  the  data  base  administrator 
must  define  the  different  views.  One  consideration  in  evaluating  a data 
base  management  system,  then,  should  be  the  ease  with  which  views  can  be 
defined  and  the  efficiency  with  which  they  can  be  implemented. 

5.  Boyce,  R.  F. , and  D.  D.  Chamberlin,  "Using  a Structured  English 
Query  Language  as  a Data  Definition  Facility,"  IBM  Research  Laboratory, 
San  Jose,  California  (December  10,  1973). 

6.  Chamberlin,  D.  D.,  J.  N.  Gray,  A.  L.  Tralger,  "Views,  Authori- 
zation and  Locking  in  a Relational  Data  Base  Design,"  IBM  Research 
Laboratory,  San  Jose,  California  (October  7,  1974). 
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It  is  reasonable  to  suppose  that,  in  the  near  future,  a generalized 
data  base  management  system  will  be  able  to  provide  a diversity  of  views, 
each  based  on  the  most  convenient  logical  access  paths:  hierarchical, 

network,  or  relational.  Until  then,  the  data  base  administrator  must 
content  himself  with  choosing  whichever  is  most  appropriate  to  his 
enterprise  as  a whole,  considering  not  only  immediate  requirements  but 
the  future  as  well.  This  is  a non-trivial  task,  since  each  access  model 
has  advantages  and  disadvantages  in  terms  of  ease  of  use,  efficiency,  and 


availability  in  data  base  management  systems.  In  general,  the  logical 
access  model  which  least  restricts  physical  access,  the  relational  model, 
is  easiest  to  use  and  potentially  most  efficient,  but  is  least  available 
because  of  the  complexities  involved  in  mapping  from  logical  to  efficient 
physical  structures.  The  hierarchical  access  model  introduces  physical 
access  paths  early  in  the  design  process,  but  has  the  advantages  of  being 
very  simple  to  use  in  simple  applications  and  of  being  readily  available. 
The  following  subsections  will  clarify  these  generalities. 

HIERARCHIES^ 

A sample  data  base  consisting  of  three  hierarchies  is  shown  below 

and  on  the  following  page: 

PART  ASSEMBLY 


7.  Date,  C.  J.,  "An  Introduction  to  Database  Systems",  Addison-Wesley , 
Reading,  Mass.  (1975),  pp.  137-221. 


■w 


V 


SUPPLIER 


The  rectangles  Indicate  logical  record  types;  the  name  of  each  Is  above 
the  rectangle,  and  the  field  names  are  within  the  rectangle  (e.g.,  PART 
(PARTI)',  PART  DESCRIPTION)).  An  arrow  Indicates  the  association  of  a 
single  occurrence  of  the  record  type  at  the  tall  with  a collection  of  zero 
or  more  occurrences  of  the  record  type  at  the  point  (e.g.,  one  ORDER 
contains  many  ORDER-ITEMs) . The  name  of  the  top-most  record  type  In  each 
hierarchy  will  be  used  as  the  name  of  the  hierarchy  itself.  Thus,  In  the 
PART  hierarchy,  access  paths  are  provided  from  each  part  to  all  orders 
for  It  and  to  all  assemblies  using  It.  In  the  ASSEMBLY  hierarchy,  an 
access  path  Is  provided  from  each  assembly  to  Its  component  parts.  In 
the  SUPPLIER  hierarchy,  access  paths  are  provided  from  each  supplier  to 
all  his  purchase  orders,  and  from  each  purchase  order  to  all  the  Items  on 
It.  Another  access  path  In  the  SUPPLIER  hierarchy  provides  the  additional 
Information  that  a supplier's  part  Is  actually  used  In  a particular 
assembly.  For  example.  If  assemblies  1 and  2 are  manufactured  at  widely 
separated  sites  but  both  use  part  3,  then  a supplier  of  part  3 might  be 
associated  with  assembly  1 or  assembly  2,  but  not  necessarily  with  both. 
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(This  might  be  convenient  In  tracing  defective  assemblies  back  to 
the  part  suppliers.) 

Note  the  duplication  of  Information  In  the  three  hierarchies  - the 
data  base  description  Is  concerned  only  with  logical  structure,  where 
redundancy  may  be  acceptable.  Physical  redundancy  Is  controlled  by 
putting  duplicated  fields  Into  one  hierarchy  and  pointers  to  them  Into 
the  other  hierarchies.  The  previous  data  base  could  be  realized  physi- 


cally as: 


r\ 

i' 


I 


physical  racords.  for  axaipla,  tha  S-A*F  racord  points  to  A and  P 
racords.  Tha  physical  records  ara  placed  in  whichever  hierarchy  is  most 
advantageous  for  tha  operation  of  tha  data  base.  The  ORDER#  and  SUPPLIER# 
fields  of  the  P-0  logical  record  are  physically  obtained  from  the  0 and  S 
records  respectively,  which  are  accessible  by  going  up  the  S hierarchy. 

In  thia  example,  a report  on  purchase  orders  will  be  more  efficiently 
processed  if  arranged  by  supplier  rather  than  by  part. 

The  primary  advantage  of  a hierarchical  systam  is  the  existence  of 
very  simple  and  useful  sequential  access  paths,  which  can  be  followed 
%rlthout  testing  for  record  type  or  value.  Three  basic  access  paths  are 
(1)  preorder  traversal  (e.g.,  the  first  S,  then  the  first  0,  then  each 
O-I  under  it,  then  the  n txt  0,  and  so  on  to  the  last  S-A-P  of  the  last 
S) , (2)  each  record  of  a set  below  a given  record  (e.g.,  each  0 under  a 
given  S) , and  (3)  the  trivial  path  from  a given  record  to  the  top  (e.g., 
from  an  O-I  to  the  0 to  the  S record) . 

The  only  difference  between  a hierarchy  and  a nested  collection  of 
repeating  groups  is  one  of  physical  access  - normally  the  records  of  a 
hierarchy  can  be  read  individually,  while  a record  with  repeating  groups 
is  read  all  at  once.  Thus,  processing  techniques  suitable  for  repeating 
groups  are  also  suitable  for  hierarchies.  In  particular,  searching  for  a 
particular  record  or  records  is  implemented  by  means  of  loops  nested  to 
correspond  directly  to  the  levels  of  the  hierarchy.  Production  of  a 
report  is  generally  extremely  simple,  since  the  hierarchical  structures 
of  the  report  and  data  base  can  also  correspond  directly. 

One  disadvantage  of  the  hierarchical  model  is  that  the  process  of 
physical  data  base  design  may  become  extremely  difficult,  because  of  the 


( 
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complications  involved  in  eliminating  physical  redundancy  while  providing 


for  a variety  of  logical  views.  Another  disadvantage  is  that  there  may 


be  no  clear  separation  between  the  logical  and  physical  design  of  the 


data  base  - the  reduction  of  physical  redundancy  through  the  use  of 


pointers  must  be  foreseen  at  a very  early  stage  of  logical  design 


However,  both  disadvantages  affect  just  the  data  base  administrator;  they 


do  not  affect  the  user,  who  sees  only  logical  hierarchies 


A possible  disadvantage  for  the  user  is  that  not  all  real-world  data 


can  be  conveniently  represented  in  hierarchies.  For  example,  the  query 


list  each  part  which  is  in  an  assembly  which  contains  part  number  001 


is  inherently  non-hierarchical.  A hierarchical  structure  would  be 


PART-ABOVE 


ASSEMBLY 


PART-BELOW 


where  the  arrow  from  PART-ABOVE  to  ASSEMBLY  indicates  that  a given  part 


may  be  used  in  many  assemblies,  and  the  arrow  from  ASSEMBLY  to  PART-BELOW 


indicates  that  a given  assembly  may  contain  many  parts.  PART-ABOVE  and 


PART-BELOW  contain  the  same  information,  but  must  be  given  different 


names  to  make  the  structure  look  like  a hierarchy  when  in  reality  it  is 


Assembly 


Used-In 


Contains 


Such  distortions  as  the  PART-ABOVE,  PART-BELOW  artifice  complicate  the 
user's  view  of  the  data.  Furthermore,  adding  another  logical  hierarchy 
to  the  data  base  would  Involve  additional  physical  pointers.  Resolving 
the  query  In  the  sample  data  base  Is  also  unsatisfactory,  since  it 
involves  both  the  P and  A logical  hierarchies:  the  P hierarchy  Is  used 

to  obtain  a set  of  ASSEMBLYls  associated  with  PART#  001,  then  the  A 
hierarchy  Is  searched  for  each  ASSQIBLY#  in  the  set,  and  then  each 
associated  P record  is  obtained  directly.  No  logical  access  path  can  be 
used  to  satisfy  this  query,  although  there  Is  an  appropriate  physical 
access  path.  Note  that  the  searching  involved  may  be  quite  time-consu- 
ming . 

A further  disadvantage  of  the  hierarchical  model  is  that  the  need 
for  a logical  access  path  must  frequently  be  foreseen  In  the  design 
of  the  logical  data  base.  For  example,  the  resolution  of  the  query  "list 
each  supplier  who  is  filling  orders  for  any  part  used  In  assembly  number 
1"  Involves  two  logical  hierarchies:  A to  produce  a set  of  parts  in 

assembly  number  1,  and  P to  produce  a list  of  suppliers. 

In  summary,  the  simplicity  of  a hlerarchlchal  data  base  Is  attractive 
but  considerations  of  physical  Implementation  may  greatly  complicate 
logical  design,  and  even  fairly  simple  queries.  If  unforeseen  or  non- 
hlerarchlcal , may  be  costly  and  logically  complex. 


CODASYL  DBTG  OWNER-COUPLED  SETS®*^ 

The  preceding  data  base  can  be  represented  in  the  CODASYL  DBTG  nodel 
as  shown  below: 

PA  S 


As  in  the  diagrans  of  the  hierarchies,  rectangles  correspond  to  record 
types.  An  arrow  points  from  a "set  owner"  type  to  a "set  member"  type, 
and  is  labelled  with  the  set  name.  An  A-P  record  represents  intersection 
data  - i.e.,  thr.t  the  owners  of  the  P-A-P-SET  and  A-A-P-SET  are  associated. 
Similarly,  an  S-A-P  record  represents  an  association  among  an  S,  an  A, 
and  a P record.  Note  that  the  DBTG  diagram  clearly  expresses  the  struc- 
tural relationships  involved:  P,  A,  and  S records  are  associated  in 

various  ways,  but  none  is  subordinate  or  superior  to  another.  The 
equivalent  relationships  are  much  less  clear  in  the  hierarchical  diagram. 


8.  CODASYL,  "CODASYL  Data  Base  Task  Grouf  Report,  April  1971",  ACM, 

New  York  (1971). 

9.  CODASYL,  "CODASYL  Data  Description  Language,  Journal  of  Development, 
June  1973,"  National  Bureau  of  Standards,  Washington,  D.C.  (1973). 
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The  DBTU  model  may  be  described  as  a simple  network  model  > network  '{) 

because  any  two  record  types  can  be  associated,  and  simple  because  the  ^ 

association  must  be  one-to-many . Complex  networks  allow  many-to-many  ; 

associations,  but  these  are  easily  represented  by  means  of  intersection  ] 

I 

sets  and  records  - c.g.,  P-A-P-SET,  A-P,  and  A-A-P-SET  establish  an  i 

I 

indirect  many-to-many  association  between  parts  and  assemblies.  DBTC 

t 

sets  are  called  "owner-coupled"  because  both  the  set  name  and  owner  | 

record  occurrence  are  required  to  identify  the  collection  of  members. 

That  is,  a sot  type  is  really  the  name  of  a one-to-many  association 
between  the  owner  record  type  and  the  member  record  types;  a set  occur- 
rence is  an  association  between  a specific  owner  record  and  set  of 
members. 

The  capabilities  and  limitations  of  the  DBTG  model  are  described 
below.  A set  may  contain  more  than  one  record  type  and  a record  may  be 
the  owner  of  more  that  one  set  type  (e.g.,  P owns  P-O-I-SET  and  P-A-P-SET). 

However,  a set  occurrence  can  have  only  one  owner  and  a record  occurrence 
can  be  a member  of  only  one  occurrence  of  a given  set  type  - hence,  a set 
name  alone  uniquely  determines  an  access  path  from  member  to  owner. 

Furthermore,  a record  occurrence  can  be  the  oimer  of  only  one  occurrence 
of  a given  sec  type  - hence,  a sec  name  alone  also  uniquely  determines  an 
access  path  from  owner  to  a set  of  records  related  to  it.  These  restric- 
tions also  simplify  physical  record  structure;  a record  type  could,  for 
example,  be  allocated  one  pointer  for  each  owned  set  type  and  one  pointer 
for  Che  owner  of  each  set  type  containing  it. 

Although  the  DBTG  and  hierarchical  data  bases  contain  identical 
daCa,Q there  are  important  differences  in  the  available  access  paths. 
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A hierarchical  data  baae  ia,  in  aoae  reapects,  a apecial  case  of  a DBTG 
data  baae  in  %ihich  a record  type  can  be  a member  of  only  one  set  type. 
Hence,  in  the  hierarchical  model,  the  set  name  need  not  be  specified  to 
uniquely  determine  an  owner  occurrence  (there  can  be  only  one  record 
above  any  other)  or  a set  of  members  of  a given  type  (there  can  be  only 
one  such  set  below  any  owner  record) . Compare  the  hierarchical  and  DBTG 
diagrams  and  note  that  names  are  needed  only  on  the  arrows  of  the  latter. 
However,  the  hierarchical  data  base  has  implied  access  paths  which  are 
not  present  in  the  DBTG  data  base:  namely,  the  preorder  traversals 

referred  to  earlier.  Sequential  access  to  a collection  of  records  is 
defined  for  only  two  cases  in  the  DBTG  model:  for  all  the  records  of  a 
set  and  for  all  the  records  of  a record  type.  In  both  models,  the  order 
of  records  within  a set  is  determined  at  the  time  records  are  inser- 
ted into  it.  The  place  of  Insertion  may  be  determined  by  content,  by 
time  of  insertion  (last-ln,  first-out,  or  first-in,  first-out,  for 
example),  or  by  an  arbitrary  procedural  decision  (i.e.,  a procedure  can 
find  a particular  record  occurrence  and  Insert  before  or  after  it) . 

The  locations  of  rectangles  are  immaterial  in  a DBTG  diagram  (only 
named  arrows  are  significant).  However,  locations  determine  the  logical 
and  possibly  even  the  physical  structure  of  a hierarchical  data  base, 
since  the  relative  locations  determine  the  sequence  of  records  processed 
in  the  preorder  tree  traversal. 

The  Sub-Schema  Data  Definition  Language  can  be  used  to  define 
hierarchical  substructures  of  a network  data  base,  but  there  are  no  Data 
Manipulation  Language  commands  to  manipulate  entire  hierarchies.  Some  of 
the  same  effects  may  be  obtained,  however,  by  the  use  of  automatically 
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Invoked  procedures  to,  for  example,  delete  everything  belov  selected 
records  when  those  records  are  themselves  deleted. 

In  general,  the  Increased  freedom  of  movement  allowed  In  a DBTG  data 
base  Is  at  the  cost  of  Increased  complexity  In  processing  hierarchies. 
However,  queries  such  as  "list  each  part  number  which  Is  In  an  assembly 
which  contains  part  number  001,"  mentioned  earlier,  are  much  more  easily 
and  naturally  expressed  In  the  DBTG  system  than  In  a hierarchical  system. 
The  query  could  be  satisfied  by  following  an  owner  to  member  to  owner 
path  from  PART#  001  through  A-P  records  to  each  associated  A record,  then 
following  an  owner  to  member  to  owner  path  to  each  associated  P record. 

The  entire  path  has  four  links,  two  one-to-one  and  two  one-to-many 
(requiring  two  loops).  No  searching  Is  necessary.  The  equivalent  query 
in  Che  hierarchical  model  Involved  two  one-to-many  links,  plus  a search 
for  each  of  a set  of  ASSEMBLY#s.  The  other  query  presented  earlier, 

"list  each  supplier  who  is  filling  orders  for  any  part  used  In  assembly 
number  1,"  is  somewhat  complicated  In  the  DBTG  data  base  - Its  resolution 
requires  an  access  path  from  ASSEMBLY#  1 to  the  members  of  an  A-A-P-SET, 
to  the  owner  of  a P-A-P-SET,  to  the  members  of  a P-O-I-SET,  to  the  owner 
of  an  0-0-1-SET,  and  finally  to  the  owner  of  an  S-O-SET.  Again,  as  in 
the  case  of  the  hierarchical  model,  Che  need  for  a particular  access 
path,  from  A to  S,  was  not  foreseen  In  the  logical  data  base  design. 

A particularly  important  difference  for  Che  present  discussion  Is 
that  the  DBTG  system  appears  to  more  effectively  separate  logical  and 
physical  descriptions  of  a data  base,  while  at  the  same  time  allowing 
reasonable  flexibility  In  Che  choice  of  physical  structures.  For  example. 
If  Che  user  of  a hierarchical  data  base  must  deal  with  logically  redundant 
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data,  then  performing  an  update  requires  knowledge  of  whether  the  data  la 
physically  redundant  or  not;  he  should  not  perform  updates  of  many 
logical  records  If  only  one  physical  record  Is  Involved.  For  the  OBTG 
user,  redundancy  can  be  eliminated  from  the  logical  data  base. 

In  summary,  the  DBTG  model  provides  more  flexibility  In  the  design 
of  logical  access  paths  than  Is  provided  by  the  hierarchical  model. 

Simple  paths,  such  as  a preorder  tree  traversal,  may  be  somewhat  more 
complicated  In  the  DBTG  model  than  In  the  hierarchical  model. 

N-ARY  RELATIONS^®’^^ 

The  preceding  section  entitled  "Design  of  Logical  Records"  Introduced 
the  notation  of  n-ary  relations.  The  example  of  the  previous  subsections 
becomes 

P (PART#,  PART-DESCRIPTION) 

A (ASSEMBLY#,  ASSEMBLY-DESCRIPTION) 

S (SUPPLIER#,  ADDRESS) 

0 (ORDER# , SUPPLIER#,  TOTAL-COST) 

O-I  (PART#,  ORDER#.  COST,  QUANTITY,  DELIVERY-DATE) 

A-P  (ASSEMBLY#,  PART#,  NUMBER-USED) 

S-A-P  (SUPPLIER#,  ASSEMBLY#,  PART  #) 

Records  must  be  In  first  normal  form.  The  records  In  the  data  base  are 
unordered,  although  an  order  based  on  record  content  may  be  imposed  upon 
a collection  of  records  when  It  Is  retrieved  (there  will,  of  course,  be 

10.  Codd,  E.  F.,  "A  Data  Base  Sublanguage  Founded  on  the  Relational 
Calculus,"  Proc.  1971  ACM-SIGFIDET  Workshop  on  Data  Description,  Access 
and  Control,  ACM,  New  York,  pp.  35-68  (1971). 

11.  Chamberlin,  D.  D.  and  R.  F.  Boyce,  "SEQUEL:  A Structured  English 

Query  Lanuage,"  Proc.  1974  ACM-SIGFIDET  Workshop  on  Data  Description, 
Access  and  Control,  ACM,  New  York  (1974). 
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an  order  imposed  by  physical  structure,  but  this  Is  subject  to  change 
whenever  the  data  base  is  reorganized) . Duplicate  records  are  auto- 
matically removed  from  the  data  base.  A logical  access  path  from  one 
record  type  to  another  is  established  by  the  existence  of  comparable 
fields  (l.e.,  fields  with  common  domains)  In  the  two  record  types.  For 
example,  supplier  number  is  a foreign  key  in  0 which  establishes  a 
one-to-many  association  between  S and  0: 

S (SUPPLIER#.  ADDRESS) 

1 ARLINGTON 

2 ALEXANDRIA 

3 BETHESDA 

0 (ORDER#,  SUPPLIER#,  TOTAL-COST) 


101 

1 

100 

102 

1 

20 

103 

1 

500 

lOA 

2 

50 

105 

3 

80 

106 

3 

100 

The  resolution  of  the  first  query  discussed  earlier,  "list  each  part 
number  which  Is  In  an  assembly  which  contains  part  number  001"  requires 
only  A-P  records.  The  assembly  numbers  of  all  records  containing  part 
number  001  are  used  to  Identify  A-P  records  which  have  the  desired  part 
numbers.  The  other  query,  "list  each  supplier  who  Is  filling  orders  for 
any  part  used  In  assembly  number  1,"  requires  A-P  records  to  determine 
part  numbers  used  In  assembly  number  1,  O-I  records  to  determine  order 
ntmibers  for  those  parts,  and  0 records  to  determine  the  supplier  numbers. 
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The  subsection  entitled  "Synthesis  of  Records  In  Third  Normal  Form" 
discussed  the  rules  for  designing  records  In  a relational  data  base. 

These  rules  allow  the  designer  only  a very  limited  degree  of  flexibility 
In  the  choice  of  record  structures:  all  that  he  can  do.  In  fact.  Is 

determine  the  order  In  which  redundant  functional  dependencies  are 
eliminated.  Otherwise,  the  logical  record  structures  will  be  completely 
determined  by  the  functional  dependencies.  Since  all  logical  access 
paths  are  determined  implicitly  by  the  contents  of  the  logical  records 
rather  than  by  explicit  structures,  the  functional  dependencies  determine 
logical  access  paths  as  well. 

In  the  hierarchical  model,  there  Is  a natural  access  path  (with 
order  determined  by  the  logical  data  structure,  content,  time  of  Inser- 
tion, or  an  arbitrary  procedure)  through  an  entire  hierarchy.  In  the 
DBTG  model,  there  Is  a path  (with  order  determined  by  record  content, 
time  of  Insertion,  or  an  arbitrary  procedure)  through  a set  occurrence  or 
record  type.  In  the  relational  model,  all  paths  are  determined  by  record 
content  alone;  order  Is  determined  (by  content)  at  the  time  a collection 
of  records  is  retrieved,  rather  than  at  the  time  a record  Is  Inserted. 

All  record  occurrences  In  a relational  data  base  are  unique;  identical 
content  of  two  record  occurrences  must  Indicate  redundancy,  since  the  two 
cannot  be  distinguished  by  set  associations  or  by  order.  Identical 
records  can  exist  In  a hierarchical  or  DBTG  data  base. 

Information  can  be  represented  by  record  content,  associations,  or 
order  In  a hierarchical  or  DBTG  data  base,  but  only  by  record  content  In 
a relational  data  base.  The  discipline  of  the  relational  model  has  a 
definite  advantage:  data  base  definition  and  manipulation  can  be  a 


science,  rather  than  an  art.  There  is  an  accompanying  disadvantage, 

however:  hierarchies  and  networks  often  correspond  more  closely  than  do 

relations  to  our  conceptions  of  the  real  world.  ! 

The  relational  model  has  the  further  disadvantage  that  modifying 
a candidate  key  in  a record  can  have  far-reaching  consequences,  since  all  : 

records  associated  with  It  by  means  of  a foreign  key  will  also  have  to  be 
updated.  This  follows  from  the  Interpretation  of  a candidate  key  as  the 
Identifier  of  a real-world  object;  If  the  name  of  an  object  is  changed, 
then  all  Information  concerning  It  must  refer  to  the  new  name.  Hence,  It 
IS  desirable  to  require  that  object  Identifiers  be  Inserted  or  deleted, 
but  not  changed.  This  may  not  accurately  model  the  real  world,  where 
identifiers  do  change  (women  may  change  their  names  through  marriage, 
part  numbers  are  changed,  and  so  on).  In  the  hierarchical  and  network 
models,  common  data  may  be  factored  out  of  a collection  of  records  and 
stored  In  a higher  hierarchical  level;  In  particular,  the  Identifier  of 
an  object  can  be  separated  from  the  records  describing  It,  so  It  need 
logically  be  changed  In  only  one  place. 

A significant  advantage  of  the  relational  model  Is  that  logical 
access  paths  are  effectively  unconstrained  - at  run-time  the  user  can 
establish  any  association  logically  Implied  by  the  data.  Furthermore, 
the  operations  available  In  relational  data  base  management  systems  allow 
the  expression  of  quite  complex  requests  In  single  statements,  allowing 
the  system  to  perform  various  optlml2atlons  based  on  the  existing  physical 
access  paths. 

A disadvantage  of  the  relational  model  Is  that  the  freedom  allowed 
in  constructing  logical  access  paths  demands  a data  base  management 
system  capable  of  performing  much  of  the  optimization  normally  performed 
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by  the  data  base  designer.  The  user  is  not  constrained  to  do  things 
"efficiently,"  for  example,  by  means  of  pre-deflned  sets,  so  the  system 
j must  somehow  translate  his  "Inefficient"  statements  Into  a form  compatible 

; with  physical  structures  (or  transform  the  physical  structures  Into 

I structures  more  compatible  with  the  user's  statements).  Accordingly, 

i 

relational  data  base  management  systems  are  at  present  generally  less 
available  and  less  efficient  than  hierarchical  or  DBTG  sytems. 


SUMMARY  OF  LOGICAL  ACCESS  PATH  DESIGN 

Many  applications  Involve  queries  which  are  sufficiently  ad  hoc  to 
defy  optimization  by  data  base  designers;  such  queries  are  best  handled 
by  relational  systems.  Other  applications,  althou^..  relatively  predic- 
table and  simple,  still  demand  high  efficiency;  for  these  a hierar- 
chical system  would  be  most  suitable.  Finally,  other  applications  are 
predictable,  complex,  and  require  high  efficiency;  these  would  require  a 
DBTG  system. 

What,  then,  of  the  many  real  environments.  In  which  all  these 
applications  are  mixed  with  others  which  are  unsuitable  for  any  of  the 
three  types  of  systems?  The  thesis  to  be  developed  In  the  following 
sections  is  that  data  base  design  Is  best  accomplished  If  decisions  are 
postponed  as  long  as  possible  - hence,  that  system  Is  best  which  most 
effectively  accommodates  changes  In  logical  requirements  and  physical 
Implementation.  At  present,  the  relational  model  most  easily  accom- 
modates changes  In  logical  requirements,  and  hence  Is  to  be  preferred  In 
the  early  stages  of  logical  design.  The  DBTG  model  Is  widely  available 
and  most  easily  accommodates  changes  in  physical  Implementation,  so  It  Is 
to  be  preferred  In  the  final  stages  of  physical  design. 
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DESIGN  OF  PHYSICAL  RECORDS 


An  access  path  provides  a way  to  locate  a desired  record;  this 
section  discusses  what  a record  looks  like  when  It  Is  located  and  read  or 
written.  In  other  words.  It  describes  how  Information  can  be  stored  to 
reduce  the  costs  of  mass  storage,  computer  time,  and  user  time.  Generally, 
the  emphasis  of  this  section  will  be  on  ways  of  reducing  physical  redun- 
dancy and  thereby  reducing  mass  storage  and  updating  costs,  at  some 
increase  In  computer  time  and  response  time.  The  following  section  on 
access  paths  will  consider  techniques  which  reduce  response  time  but 
generally  Increase  the  amount  of  mass  storage. 

This  section  will  present  possible  answers  to  three  basic  questions 
in  the  design  of  physical  records: 

1.  What  schemes  can  be  used  to  Indicate  the  presence  or 
absence  of  optional  fields?  (In  an  address  file,  for  example,  some 
people  have  apartment  numbers  and  some  do  not.) 

2.  Where  Is  the  value  of  a given  field?  (To  reduce  transfer 
time,  frequently  and  Infrequently  used  fields  from  one  logical  record  may 
be  in  different  physical  records.) 


The  term  "control  field"  «flll  be  used  to  denote  a field  which 
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Is  part  of  a physical  record,  but  which  Is  accessed  only  by  the  data  base 

management  system  and  not  by  the  user.  The  term  "physical  record" 

Implies  physical  contiguity.  A logical  record  may  consist  of  many 

physical  records  linked  together  by  pointers.  For  example,  binary 

12 

relations  have  been  used  to  Implement  an  n-ary  relational  system. 
Furthermore,  pointers  need  not  be  physical  addresses,  but  may  be  key 
values  or  relative  addresses. 


DETERMINATION  OF  A FIELD'S  PRESENCE  OR  ABSENCE 


At  least  six  possible  methods  exist  for  determining  whether  or  not  a 
given  field  is  actually  present  in  a record  occurrence. 

• First,  a field  may  be  known  to  be  logically  present  at  all 
times  (a  fixed  field). 

• Second,  a field  may  be  known  to  be  present  at  all  times  within 
a repeating  group. 


• Third,  the  combination  of  optional  fields  which  are  actually 
present  may  be  indicated  by  an  encoded  control  field. 

• Fourth,  the  presence  or  absence  of  an  individual  field  may 
be  indicated  by  a bit  in  a control  field. 

• Fifth,  a field  may  be  physically  present  at  all  times  but 
considered  logically  absent  if  it  has  a particular  (null)  value  (e.g.,  an 
illegal  character) . 


• Sixth,  each  field  present  may  be  identified  by  a name  or  label. 
Clearly  the  first  possibility  Involves  no  problem  of  representation.  The 


12.  Lorie,  R.  A.,  "XRM  - An  Extended  (N-ary)  Relational  Memory," 
Technical  Report  320-2096,  IBM  Scientific  Center,  Cambridge,  Mass. 
(January,  1974) . 
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second  Involves  Indicating  the  number  of  repetitions  of  a given  group, 
and  will  be  considered  In  the  subsection  on  "Determination  of  a Field's 


Location".  The  remaining  four  possibilities  are  reasonable  alternatives 
for  representing  records  with  variable  logical  contents. 

The  third  alternative,  an  encoded  control  field,  provides  a repre- 
sentation with  mathematically  minimal  storage  space.  However,  It  may 
require  encoding  and  decoding  procedures  which  are  quite  costly  In  terms 
of  both  time  and  storage.  Accordingly,  such  optimal  encoding,  described 
below,  will  be  used  primarily  to  suggest  and  evaluate  other  techniques, 
described  at  the  end  of  this  subsection. 

To  Introduce  an  optimal  encoding  scheme,  assume  that  there  are  n 
optional  fields.  There  are  therefore  2^  possible  combinations  of  fields 
present  and  absent.  Each  combination  can  be  considered  a "message"  with 
some  associated  probability,  say  . Unless  each  optional  field 

Is  present  precisely  half  the  time,  these  messages  will  occur  with  unequal 
probabilities.  Therefore,  It  is  reasonable  to  represent  the  more  probable 


messages  with  short  code  words,  and  the  less  probable  with  longer  code 

words.  The  following  discussion  merely  provides  a technique  for  optimally 

13 

assigning  code  words.  Communication  theory  can  be  applied  to  yield  the 
result  that  the  average  number  of  bits  required  to  Identify  a message 


(l.e.,  to  specify  which  fields  are  present)  Is  bounded  below  by 


numbers  of  "messages"  are  coded  together,  l.e.,  if  an  arbitrarily  large 

number  of  records  can  be  considered  as  one  super-record,  vlth  one  code 

field  Indicating  the  presence  or  absence  of  all  fields  In  all  records. 

This  Is  clearly  Infeasible,  but  simpler  encodings  will  provide  acceptable 

13  14 

results.  Huffman  coding  * will  be  described  here  and  la  also  appli- 
cable to  the  data  compression  discussed  In  the  subsection  on  "Determina- 
tion of  Field  Values". 

The  following  simple  procedure  works  by  constructing  a binary  tree, 
with  the  most  probable  messages  near  the  root  and  the  least  probable  the 
farthest  from  It. 


Huffman  Coding 


list. 


1.  Each  message,  with  associated  probability.  Is  put  onto  a 


2.  The  two  messages  with  least  probabilities,  say  m^and  m^ 


are  removed  from  the  list.  A new  message,  say  mj^,  with  the  sum  of  those 
probabilities.  Is  added  to  the  list,  and  Is  represented  as  the  following 


3.  If  there  Is  more  than  one  message  on  the  list,  go  to  2. 

4.  Otherwise,  the  result  will  be  a binary  tree.  Assign  a 0 to 
each  left  branch  and  a 1 to  each  right  branch.  A message  at  a leaf  Is  then 
Identified  uniquely  by  the  sequence  of  O's  and  I's  which  are  on  the  path 
from  the  root  to  It. 


14.  Huffman,  D.  A.,  "A  Method  for  the  Construction  of  Minimum- 
Redundancy  Codes,"  Proc.  I.R.E.,  Vol.  40,  p.  1098  (September,  1952). 
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This  procedure  can  be  proven  to  yield  an  optimal  encoding,  in  the 
sense  that  no  other  encoding  of  the  same  set  of  messages  has  a smaller 
average  number  of  bits  per  message.  The  procedure  will  be  illustrated  by 

three  fields  with  the  following  probabilities:  Pj^  ■ .9,  p^  “ .8,  p^  ■ .2. 

The  fields  are  assumed  to  be  independent.  The  messages  and  their  asso- 
ciated probabilities  are  given  below: 


■ 

.016 

(no  field  present) 

pSoi^ 

* 

.004 

(field  3 present) 

P^"oio^ 

m 

.064 

(field  2 present) 

P^"oil^ 

m 

.016 

(fields  2 and  3 present) 

P^“l00^ 

m 

.144 

(field  1 present) 

m 

.036 

(fields  1 and  3 present) 

P^”llO^ 

m 

.576 

(fields  1 and  2 present) 

p(«lll) 

m 

.144 

(fields  1,  2 and  3 present) 

The  corresponding  coding  tree  is  shorn  in  the  next  figure,  with  probabili- 


ties in  parentheses  following  the  messages.  The  messages  m^  to  m^  were 
produced  by  combining  other  messages  (e.g.,  m means  that  either 


The  following  table  gives  the  naae  of  each  nesaage,  Ita  coding,  the 
length  of  Its  coding.  Its  probability,  and  the  product  of  coding  length 
and  probability: 


^1 


n : 

rr-] 

>• 

I. 

r 

I 

l 
i. 


I . 


message 

code 

length 

P 

p X length 

'’'ooo 

0101011 

7 

.016 

.112 

"•ooi 

0101010 

7 

.004 

.028 

%10 

0100 

4 

.064 

.256 

'"oil 

010100 

6 

.016 

.096 

•"lOO 

oil 

3 

.144 

.432 

•"lOI 

01011 

5 

.036 

.180 

'"no 

1 

1 

.576 

.576 

•"in 

00 

2 

.144 

.288 

average  length  = 1.968 


Hence,  the  average  length  of  the  code  word  is  1.968  bits,  considerably 
better  than  the  three  bits  required  If  each  field  were  coded  by  a sepa- 
rate bit,  and  very  close  to  the  1.913  bits  required  for  the  theoretical 
minimum  (assuming  many  records  were  coded  together). 

Advantages  and  Disadvantages  of  Huffman  Coding 

Huffman  coding,  as  noted  earlier,  has  the  advantage  of  providing  the 

smallest  average  number  of  bits  to  represent  the  presence  or  absence  of  ^ 

3 

each  field.  The  disadvantages  are,  first,  that  the  encodings  are  variable  ’ 
length,  which  Implies  a need  for  variable  length  records,  and,  second,  a 
possibly  large  amount  of  storage  used  by  tables  In  the  decoding  process. 

The  former  disadvantage  Is  common  to  the  other  techniques  discussed 
below,  so  will  not  be  a factor  In  choosing  a technique.  The  storage 
problem  Is,  however,  quite  serious:  If  there  are  n fields,  then  there 

are  2"  messages,  each  of  which  specifies  the  presence  or  absence  of 
each  of  those  n fields.  Hence,  the  decoding  tree  consists  of  2"  leaves  of 
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1 


n bits  each.  Disregarding  Che  representation  of  the  tree  and  the  decoding 


procedure,  this  Is  n2  bits.  A practical  limit  for  such  encoding,  Chen 


Is  probably  around  10  fields  (more  than  10,000  bits).  However,  there  are 


reasonable  approximations  Co  Huffman  codings  which  are  much  less  space 


consuming.  Assume  In  the  remainder  of  this  subsection  that  Che  probabili- 


ties }f  occurrence  of  Che  fields  are  Independent 


Any  field  %)hose  probability,  p.  Is  near  .5  can  be  encoded  separately 


from  Che  other  fields  by  a single  bit.  The  number  of  bits  used  Is  1,  so 


the  difference  In  bits  between  this  encoding  and  the  optimal  encoding  Is 


bits 


The  remaining  fields  can  be  divided  Into  three  groups:  fields  with 


probability  near  0,  fields  with  probability  near  1,  and  everything  else 


If  there  are  k fields  with  near  0 probabilities,  say  p 


then  chose 


fields  which  are  present  can  be  identified  by  labels  of  [log  k]  bits  each 


where  "["  and  "] " brackets  Indicate  the  least  Integer  greater  than  or  equal 


to  the  real  number  Inside  them.  A count  field  of  [log  (k  + 1)1  bits  Indi- 


cates the  number  of  fields  present.  (An  alternative  scheme  would  be  to 


use  a special  label  to  Indicate  the  end  of  the  fields.  In  which  case  both 


It  and  the  other  labels  would  require  [log  (k  -t-  1)]  bits.)  The  average 


number  of  bits  used  Is 


The  average  number  of  bits  per  field  Is 


If  the  fields  have  equal  probabilities,  say  p,  then  this  expression  reduces 


This  function  has  local  minima  for  k of  the  form  2^-  1,  where  J la 
an  Integer.  The  following  table  shows  the  function  for  p > .1: 


k 

biti/fiald 

7 

.73 

15 

.67 

31 

.66 

63 

.70 

(Clearly,  there  Is  no  reason  to  use  k - 63;  two  groups,  each  with  k *•  31, 
would  be  superior,  even  If  one  field  had  to  be  represented  by  a separate 
control  field.)  Note  that  these  values  compare  reasonably  well  with  the 
minimum  possible  value  of  .469.  The  values  are  much  superior  to  the 
alternative  of  using  a control  bit  for  each  field.  A similar  scheme  can 
be  used  for  fields  whose  probability  of  occurrence  is  near  1:  labels  are 

used  for  those  fields  which  are  not  present. 

An  alternative  representation  for  fields  with  probability  near  1 may 
be  the  use  of  a special,  null  value,  e.g.,  0 in  a pointer  field,  -0  In 
a I's  complement  numeric  field,  or  an  Illegal  bit  configuration  In  a 
character  field.  The  expected  number  of  bits  Is  then  the  probability  of 
the  field  being  absent  times  the  field  length. 

The  remaining  group  of  fields  (with  probabilities  not  near  0,  .5, 
or  1)  may  be  subdivided  Into  small  groups,  each  of  which  Is  represented 
separately  by  Huffman  encoding.  This  procedure  greatly  simplifies  the 
encoding/decoding  trees:  If  a group  of  n fields  Is  subdivided  Into  m 

equal  groups,  the  tree  Is  reduced  from  2"  leaves  of  n bits  each  to  m 
trees  of  2"^”  leaves  of  n/m  bits  each,  or  n2”^®  bits  total.  As  noted  in 
the  example  of  three  fields,  the  encoding  can  be  very  good  even  for  a < 
small  number  of  fields.  The  critical  factor  Is  that  messages  which  arei 
combined  should  be  of  about  equal  probability,  because  they  are  dlstln- 


43 


4 


I 

i 

i 

i 

i 


s! 

' \ 

[i 

I ; 

!i  ■ 


f 

t 

1 


gulshed  by  a bit  In  the  decoding  tree,  and  a bit  represents  the  maxlnum 
Information  If  it  distinguishes  two  equally  likely  messages.  Equally 
probable  branches  are  especially  important  at  the  top  of  the  tree,  tihere 
the  probability  of  a message  occurrence  is  large.  Hence,  the  most  likely 
message  (l.e.,  the  most  likely  combination  of  fields  present  and  absent) 
should  have  a probability  of  .5  or  less.  This  condition  is  fairly  easily 
satisfied,  since  fields  with  near  0 and  near  1 probabilities  are  put  into 
separate  groups.  (For  example,  if  three  fields  each  have  probability  .8, 
then  the  most  likely  mesage  has  probability  .312). 

Summary  Of  Field  Determination 

The  foregoing  subsection  has  discussed  Huffman  coding  as  the  optimal 
technique  for  determining  fields  which  are  present  and  absent  in  a 
record.  Since  Huffman  coding  for  a large  .jmber  of  fields  involves  a 
very  large  encoding/  decoding  tree,  special  cases  are  considered: 

1.  A field  with  probability  of  occurrence  near  .5  is  determined 
by  a single  control  bit.  Independently  of  other  fields. 

2.  Fields  with  probability  of  occurrence  near  0 are  determined, 
if  present,  by  labels  In  a control  field  or  fields,  plus  a count  to 
indicate  the  number  of  fields. 

3.  Fields  with  probability  of  occurrence  near  1 are  determined, 
if  absent,  by  labels  as  above.  They  may  also  be  determined  by  null 
fields. 

A.  The  remaining  fields  are  divided  into  small  groups  (the 
criterion  for  smallness  being  that  the  most  likely  combination  of  field 
occurrences  should  have  probability  .5  or  less),  which  are  encoded 
separately  with  Huffman  coding. 
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Although  this  coding  acheac  Is  rather  complex.  It  can  be  used  to 


suggest  and  evaluate  simpler  alternatives 


DETERMINATION  OF  A FIELD'S  LOCATION 


A logical  record  may  be  represented  by  one,  or  by  a number  of  physical 


records.  In  the  latter  case,  a field's  value  may  be  obtained  either  by 


following  a pre-deflned  access  path  to  a record  containing  that  field,  or 


by  invoking  a procedure  which  computes  the  value,  probably  In  a manner 


unknown  to  the  data  base  management  system.  Conversely,  a physical 


record  may  contain  (parts  of)  many  logical  records 


Physical  Record  Corresponding  to  Logical  Record 


Consider  first  the  esse  In  whl^  all  fields  of  a logical  record 


are  in  one  physical  record.  Then,  In  order  to  extract  the  field's  value 


it  is  necessary  only  to  know  the  locations  of  the  first  and  last  bits  of 


the  field.  If  all  fields  are  of  fixed  length  and  are  always  present  In  a 


fixed  order,  the  required  information  can  be  stored  In  a data  dictionary 


If  there  are  optional  fields  or  If  the  order  may  vary,  the  data  fields  In 


a particular  record  must  be  described  In  a control  field  or  fields.  For 


example,  a label  or  name,  as  in  the  previous  subsection,  could  precede 


each  field,  or  the  order  of  the  labels  in  contiguous  control  fields  could 


Indicate  the  identity  and  order  of  the  data  fields.  A count  field  or  a 


special  label  Is  necessary  to  Indicate  how  many  fields  actually  occur 


Examples  are  shown  below  for  physical  representations  of  the  following 


logical  records 


EMP  (ID#,  NAME,  PHONE) 


001  JONES  12345 


002  SMITH 


If  the  order  and  field  length  are  fixed,  JONBS's  physical  record  sight 


be  001  JONESbbb  12345 

(A  "b"  Indicates  a blank.)  If  the  order  Is  variable,  then  the  record 
might  be 

I 001  N JONESbbb  P 12345 

where  ID#,  NAME,  and  PHONE  are  represented  here  by  characters  but  tfould 
normally  be  encoded  as,  for  example,  0,  1,  and  2,  respectively.  If  SMITH 
has  no  phone,  and  I (encoded  as  3)  Is  used  to  Indicate  end  of  record, 
then  smith's  record  might  be 

N SMITHbbb  I 002  # 

Another  representation  Is 

I N I 002  SMITHbbb 

where  # Indicates  the  end  of  the  control  fields.  Another  representation  Is 
2 I N 002  SMITHbbb 

where  the  Initial  control  field  Is  a count  of  the  number  of  fields 
present. 

If  fields  are  not  of  fixed  length,  then  It  Is  necessary  to  know  not 
only  where  a field  begins  but  also  where  It  ends.  If  the  beginning  of 
the  first  field  Is  at  a fixed  location  In  the  record,  and  the  length  of 
each  field  Is  stored  within  the  record,  then  all  the  beginnings  and 
endings  can  be  computed.  Or,  If  a pointer  to  the  end  of  each  field  is 
stored,  then  the  length  of  each  field  can  be  computed.  For  example,  with 
the  addition  of  a count  field  after  the  label  Indicating  a name,  SMITH'S 
record  can  be  represented  as 
2 I 002  N 5 SMITH 

(the  ID#  Is  of  fixed  length,  so  needs  no  count).  The  count  field  func- 
tions as  a pointer  or  offset  from  the  first  bit  of  a field  to  the  first 
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bit  of  the  next  field.  If  a field  has  aubflelda  - obviously  the  case  If 
the  field  can  be  of  variable  length  - then  It  Is  possible  that  an  other- 
wise unused  bit  pattern  In  a subfield  could  be  used  as  a flag  to  Indicate 

the  end  of  the  field.  For  example.  In  the  field  with  character  subfields, 

2 I 002  N SMITH! 

It  Is  assumed  that  "i"  will  never  occur  In  an  employee's  name.  Clearly 
the  choice  of  count  or  flag  depends  on  the  possible  length  of  the 
field.  If  a character  Is  6 bits,  then  a maximum  of  63  characters  can 
be  described  by  a count  of  the  same  6 bits. 

The  optimal  representation  for  a count  field  Is  a Huffman  code, 
where  each  "message"  Is  a field  length.  For  example,  suppose  that  the 
following  table  gives  the  length,  probability,  and  code  for  some  field: 


Length 

P 

Code 

0 

.001 

1111110 

1 

.005 

111110 

2 

.03 

1110 

3 

.2 

10 

4 

.7 

0 

5 

.05 

110 

6 

.01 

11110 

7 

.004 

1111111 

The  average  length  of  the  code  Is  1.485  bits,  as  compared  with  3 bits 
for  fixed  length  coding.  As  before,  a Huffman  code  can  be  a very  useful 
tool  for  evaluating  the  performance  of  other  schemes;  the  savings  of 
1.515  bits  might  or  might  not  be  considered  worth  the  extra  processing  In 
encoding  and  decoding. 

If  the  order  and  size  of  the  fields  are  assumed  fixed,  then  the 
presence  or  absence  of  an  optional  field  Is  Indicated  by  one  of  the 
schemes  of  the  preceding  subsection.  If  the  size  of  a field  Is  variable, 
then  a count  field  of  0 can  be  used  to  Indicate  that  the  field  Is  absent, 
as  In  the  example  above.  The  cost  of  a 0 count  field  Is  (l-p)C,  where  p 


count  field.  This  cost  can  be  compared  with  those  of  the  preceding 


schemes  to  determine  the  least  costly  representation  for  optional  fields; 


With  labelling  and  a fixed  order  of  ID#,  NAME,  and  PHONE,  the  JONES 


and  SMITH  records  are 


and 


001  N 5 JONES  P 123A5  # 

002  N 5 SMITH  # 


assuming  NAME  and  PHONE  are  both  optional  (note  that  the  # In  the  first 


recorc  s unnecessary;  since  all  optional  fields  are  present,  the  end 


» V 


of  the  record  Is  known  without  It) . With  a count  of  0 Indicating  a 
field  whi ',h  Is  absent,  JONES' s record  Is 
001  5 JONES  P 12345  # 

and  employee  003,  whose  name  Is  unknown,  has  the  record 
003  0 P 67890  # 

Other  representations  for  optional  fields  were  discussed  In  the  preceding 
subsection. 

/ 

Logical  Record  Contained  In  Many  Physical  Records 

Assume  now  that  the  logical  record  need  not  be  represented  by 
contiguous  fields.  The  possible  alternatives  and  advantages  are: 

1.  Frequently  used  or  short  fields  may  be  placed  In  a primary 
record  segment,  with  a pointer  to  less  frequently  used  or  longer  fields 


which  are  placed  In  one  or  more  secondary  segments.  Smaller  segments 

i 

I of  course,  save  transmission  time  and  buffer  space.  Eisner  and  Severance^^ 


15.  Eisner,  M.  J.,  and  D.  G.  Severance,  "Mathematical  Techniques  for 
Efficient  Record  Segmentation  In  Large  Shared  Data  Bases",  Technical 
Report  No.  261,  Department  of  Operations  Research,  Cornell  University, 
Ithaca,  New  York  (July,  1975). 
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have  devised  an  efficient  algorithmic  for  computing  optimal  or  near->op- 
tlmal  field  assignments,  given  frequency  of  access  to  each  field  by  each 
of  a population  of  users.  Access  time,  transfer  time,  the  cost  of  stor> 
age,  and  the  extra  pointer  required  In  segmentation  are  all  considered 
In  the  algorithm. 

2.  Exceptional  cases  (such  as  very  Infrequently  occurring 
fields,  or  fields  which  overflow  normal  maximum  size)  can  be  accommodated 
with  little  If  any  Impact  on  normal  cases.  If  the  pointer  Is  placed  In 
the  secondary  segment.  Bobrow  discusses  hashing  schemes  for  efficiently 
finding  the  secondary  segment  from  the  primary. 


Binary  Relations 

The  extreme  case.  In  which  all  logical  records  are  represented  by 

binary  relations,  has  four  other  advantages: 

1.  Records  may  be  of  uniform  size,  simplifying  each  record 

access  but  adding  to  the  number  of  accesses. 

2.  Only  fields  which  are  actually  needed  are  accessed  (the 
extreme  of  point  1) . 


3.  Fields  may  be  added  or  deleted  with  no  effect  on  the  rest 


of  the  data  base. 


4.  Extremely  specialized  access  paths  and  redundant  represen- 
tations may  be  used  to  provide  very  fast  response  to  complex  queries. 

The  disadvantages  of  binary  records  are  the  many  accesses  necessary 
to  retrieve  a large  logical  record,  and  the  cost  In  storage  of  linking 


records  together. 

16.  Bobrow,  D.  G.,  "A  Note  on  Hash  Linking",  CACM,  Vol.  18,  No.  7 
pp.  413-415  (July,  1975). 

17.  Martin,  J.,  "Computer  Data-Base  Organization,"  Prentice-Hall, 
Englewood  Cliffs,  New  Jersey,  p.  397  (1975). 
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Non-contiguous  fields  may  be  linked  together  in  any  of  the  ways 
discussed  in  the  following  section  on  physical  access  paths.  For  example 


1 

i 

I 

I 


001  X>NES  001  12345 


where  the  identifier  at  the  root  of  each  tree  is  the  relation  name.  hi 

Three  of  the  possible  ways  of  linking  these  two  physical  records  are:  |! 

1.  An  address  pointer  from  ID-NAME  to  ID-PHONE, 

t 

2.  An  address  pointer  from  ID-PHONE  to  ID-NAME,  || 

I "j, 

3.  The  synonym  chain,  if  the  records  are  hashed  on  ID#.  !| 

Linkages  2 and  3 have  the  advantage  chat  an  absent  phone  number  consumes 
no  space.  Furthermore,  queries  not  involving  phone  number  do  not  have  to 
determine  whether  the  field  is  present  or  absent;  the  encoding  schemes  of 
the  preceding  subsection  generally  involved  some  processing  whether  the 
optional  field  was  needed  or  not. 

The  representation  is  not  so  straightforward  when  the  primary  key 
consists  of  more  than  one  field.  For  example,  the  primary  key  of 
EMP-SKILL  (EMP#,  SKILL-CODE.  RATING) 
must  be  represented  by  two  linked  record  types: 

EMP-SKILL 

EMP  # CODE-RATING 

SKILL-CODE  RATING 


50 


or 


EMPSKILL 


EMP-CODE 


RATING 


EMP  » SKILL-CODE 

or  another  permutation  of  the  leaves.  Two  accesses  are  now  required 
to  determine  the  RATING  field  value;  one  can  selectively  retrieve  de- 
scriptive fields  only  after  retrieving  all  of  the  primary  key. 


Physical  Record  Containing  Many  Logical  Records 

The  preceding  discussion  pertained  to  fields  located  within  a 
physical  record  corresponding  to  a logical  record.  Where  possible,  it  is 
frequently  desirable  to  store  many  identical  fields  only  once  - this 
reduces  the  cost  of  storing  and  updating  the  field,  but  may  increase  the 
cost  and  time  to  retrieve  it.  For  example,  in  the  logical  record  types 
EMP  (EMP#,  NAME,  ADDRESS) 

EMP-SKILL  (EMP#.  SKILL-CODE,  RATING) 
the  EMP-SKILL  record  type  could  be  represented  as  a separate  physical 
record  or  as  a repeating  group  within  EMP.  In  the  latter  case,  EMP# 
would  be  stored  only  once  within  EMP,  and  not  within  each  repetition  of 
EMP-SKILL.  Clearly,  this  facilitates  the  processing  of  queries  such  as 
"What  are  the  skills  of  001?"  but  complicates  the  processing  of  queries 
such  as  "Who  has  skill  3?" 

For  purposes  of  physical  representation,  repeating  groups  can  be 
treated  as  individual  fields.  The  number  of  repetitions,  the  location  of 
each,  the  length  of  each,  and  the  special  case  of  no  repetitions  can  all 
be  represented  by  techniques  already  discussed  in  this  section. 
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Summary  Of  Field  Location 

The  preceding  subsection  has  considered  vsrlous  representations  of 
the  order  of  fields  within  a record,  and  of  the  starting  bit  and  the 
length  of  each  field.  Labels  may  be  used  to  Indicate  both  order  and  the 
fields  which  are  present.  A count  field,  possibly  encoded  by  a Huffman 
code,  or  an  otherwise  unused  combination  of  bits  In  a subfield,  may 
Indicate  the  length  of  a field.  A 0 length  may  Indicate  the  absence  of 
an  optional  field. 

The  advantages  and  disadvantages  of  non-contlguous  fields,  and 
binary  relations  In  particular,  have  been  discussed.  Binary  relations 
can  provide  very  fast  response  for  retrieving  a single  field  from  a 
record  Identified  by  a single  key  field,  but  are  costly  In  storage  and 
unsuitable  for  retrieving  a large  number  of  fields. 

DETERMINATION  OF  A FIELD'S  VALUE 

This  subsection  will  consider  various  techniques  for  compressing  the 
lengths  of  character  strings.  One  technique,  briefly  discussed  here, 
replaces  character  strings  by  pointers  or  Index  values.  This  may  or  may 
not  reduce  total  storage  requirements,  depending  upon  how  often  character 
strings  are  repeated.  A name  file  could  be  compressed  very  effectively, 
since  a small  number  of  first  and  last  names  are  very  common  (Martin, 
1975,  pp. 436-437) This  technique  may  Increase  processing  speeds, 
since  most  manipulations  (primarily  comparisons)  are  performed  on  short, 
fixed-length  fields.  The  cost  of  storage  may  also  be  reduced  If  the 
character  strings  can  be  placed  on  slow,  cheap  storage  with  only  pointers 
In  faster  storage. 


Two  compression  techniques,  Co  be  discussed  In  detail.  Involve 
elimination  of  repeated  characters  and  character  encoding  (Martin,  1975, 
pp.433-448) The  two  techniques  differ  In  Chat  Che  former  Is  generally 
most  effective  when  applied  to  a sorted  collection  of  fields;  the  latter 
Is  generally  limited  to  a field  or  even  a single  character  within  one 
record.  The  t«ro  techniques  can  be  used  slaultanSously  to  provide  very 
high  degrees  of  compression  on  some  types  of  data  (especially  sorted 
lists  of  names,  key  trords,  addresses  and  the  like). 

Consider  first  Che  elimination  of  repeated  characters.  Zeros  and 
blanks  frequently  appear  as  leading  or  trailing  characters  In  applications 
Involving  text  processing  or  accounting.  An  obvious  technique  Is  to 
replace  a string  of  such  characters  by  a flag,  such  as  an  otherwise 
unused  bit  or  character,  and  a count  of  Che  characters  removed. 

A much  more  effective  technique,  where  applicable.  Is  to  sort  all 


Che  occurrences  of  a given  field  (say  NAME  In  the  EMP  record  type  of  the 

preceding  subsection)  and  eliminate  repetitions  of  strings  from  one  field 

to  Che  next.  For  example.  If  a sequence  of  names  Is 

SMITH,  JOHN  J. 

SMITH,  JOHN  R. 

SMITH,  JOSEPH  A. 
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records,  which  asy  then  be  sorted  on  EMP#.  EMP#  any  obviously  be  easily 
coapressed  - e.g.,  by  recording  differences  froa  the  previous  field. 

The  critical  requireaent  is,  of  course,  that  the  field  to  be  coapressed 
be  a sort  field.  This  aay  be  prohibitively  expensive  if  that  field 
is  very  volatile.  The  following  section  on  physical  access  paths  discusses 
various  techniques  for  aaintaining  sorted  records. 

The  second  technique  is  the  encoding  of  individual  characters  or 
very  short  and  coaaon  sequences  of  characters.  For  example,  Martin 
proposes  a aodification  of  the  five-bit  Baudot  code  to  Include  three 
shifts:  letter,  nuaber,  and  control  (Martin,  1975,  pp. 440-442) . Since 

in  aany  applications  most  strings  are  either  alphabetic  or  numeric,  but 
not  both,  shifts  will  be  relatively  rare  so  that  the  average  character 
length  will  be  about  five  bits,  yet  87  characters  are  available  (i.e.,  32 
bit  combinations  per  shift  minus  3 shift  characters,  or  29,  times  three 
shifts).  As  another  example,  in  a name  file,  commonly  used  names  and 
titles  could  be  encoded  by  otherwise  unused  characters. 

The  most  effective  technique,  and  also  the  most  complex,  is  again 
that  of  Huffman  encoding.  The  modified  Baudot  coding  can  be  improved  to 
reflect  the  fact  that  alphabetic  characters  are  far  from  equally  probable 
in  ordinary  text.  Even  greater  Improvements  may  be  made  by  encoding 
pairs  or  some  triples  of  characters.  The  maximum  improvement  is  dependent 
upon  the  frequencies  of  occurrence  in  a particular  collection  of  fields, 
but  for  reasons  of  simplicity  it  may  be  preferable  for  one  encoding 
scheme  to  be  used  for  all  suitable  fields  in  a data  base.  In  this  case, 
encoding  and  decoding  could  be  done  very  economically  by  means  of  one 
microprogram  (Martin,  1975,  p.447).^^ 
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As  noted  earlier,  repeated  charactera  may  be  eliminated  first, 
and  then  the  result  encoded  by  some  technique  such  as  Huffman  coding. 

Note  further  that  measurement  of  the  frequencies  of  occurrence  of  charac- 
ters should  be  made  on  those  characters  actually  encoded  - 1.  e.,  the 
strings  after  elimination  of  repeated  characters.  In  this  case,  the 
counts  should  be  treated  separately  from  the  other  characters,  since 
their  distribution  Is  obviously  quite  different. 

In  summary,  the  elimination  of  repeated  characters  is  simple  and 
may  be  very  effective  in  some  instances.  The  elimination  of  repeated 
strings  from  sorted  fields  may  be  much  more  effective,  but  requires  the 
maintenance  of  a sort  order  and  a more  complex  process  of  encoding  and 
decoding.  Encoding  of  individual  characters  or  short  strings  by  Huffman 
coding  can  be  extremely  effective,  but  the  encoding  may  be  fairly  costly 
in  terms  of  time  and  the  required  coding  tree. 

SUMMARY  OF  PHYSICAL  RECORD  DESIGN 

This  section  has  concentrated  primarily  on  various  techniques  to 
reduce  mass  storage  requirements.  The  first  subsection  discussed  tech- 
niques for  indicating  the  presence  and  absence  of  optional  fields.  The 
second  subsection  discussed  techniques  for  determining  the  location  of 
variable-length  or  variable-order  fields  in  a record.  The  final  sub- 
section discussed  techniques  for  compressing  character  strings.  The 
unifying  technique  of  the  section  has  been  the  applications  of  Huffman 
coding  to  compress  records  and  to  evaluate  other  techniques. 

The  following  section,  on  the  "Design  of  Physical  Access  Paths", 
discusses  techniques  for  improving  the  speed  of  various  operations, 
particularly  retrieval. 
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DESIGM  OF  PHYSICAL  ACCESS  PATHS 
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This  section  consists  of  five  subsections.  The  first  four  discuss  } 

i 

these  topics:  classlflcstlon  of  the  various  types  of  physical  access  | 

paths,  examples  of  then,  the  parameters  which  apply  to  them,  and  the  | 

' i 

parameters  which  apply  to  their  utilization.  The  final  subsection  Is  a 
summary. 

The  design  of  physical  access  paths  will  be  treated  as  If  It  were 
Independent  of  the  design  of  logical  and  physical  record  structures, 
and  Independent  of  the  design  of  logical  access  paths;  that  Is,  It  Is 
assiuned  that  the  data  base  management  system  Is  able  to  perform  such 
functions  as  the  amalgamation  of  physical  segments  of  a logical  record. 

The  emphasis  will  be  primarily  on  the  efficiency  of  operation  and  only 
peripherally  on  such  topics  as  the  complexity  and  maintainability  of  the 
access  programs.  The  assumption  that  any  Investment  In  those  programs 
will  be  repaid  by  decreased  response  times  or  hardware  requirements 
Is  reasonable  only  If  the  access  program  Is  heavily  used. 
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BLOCKS 

A physical  access  path  Is  basically  a method  for  partitioning 
the  data  base  Into  smaller  and  smaller  "blocks",  where  a block  Is  either 
a record  or  a collection  of  blocks.  For  example,  a search  for  a record 
using  IBM's  Indexed  Sequential  Access  Method  (ISAM)  Involves  a search 
In  a master  Index  to  find  the  appropriate  track  of  the  cylinder  Index, 
a search  In  that  track  for  the  appropriate  cylinder,  a search  within  that 
cylinder's  Index  for  the  appropriate  track,  and  finally  a search  within 
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the  track  for  the  appropriate  record.  An  example  is  given  below  (un 
necessary  details  omitted): 


MASTER  INDEX 


One  Track 


100  200  300 

400  SOO  600 

700 1 |e00|j|900| 

|730l  |77ol,l800 


Consecutive  Tracks 


Track  Index 


Overflow 

Note  that  each  entry  in  an  index  gives  the  last  entry  in  the  block. 

For  example,  to  locate  760,  the  master  index  entry  900  is  used  (since  600 
is  less  than  760),  then  the  cylinder  index  entry  800  is  used  (700  is 
too  small  and  900  too  large),  then  the  track  index  770.  The  track  has 
overflowed,  so  the  overflow  chain  is  follwed  to  760. 

A block  is  a hierarchical  or  tree  structure  in  which  the  leaves 
consist  of  records.  The  highest  level  is  directly  accessable  by  the 
data  base  management  system;  lower  levels  are  accessable  only  through 
moveswnt  within  the  tree.  A record  may  be  a part  of  many  different 
blocks,  each  representing  a different  access  path  to  it. 

Complex  queries  - i.e.,  queries  involving  values  for  more  than  one 
field  - can  be  made  quite  efficient  by  means  of  rather  simple  block 

18.  IBM  Corporation,  "Introduction  to  IBM  Direct-Access  Storage 
Devices  and  Organization  Methods,  GC20-1694-8",  IBM,  White  Plains,  New 
York,  pp.  55-63  (1974). 
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structures.  For  exsaple,  suppose  the  data  base  consists  of  the  following 


records: 


i FURNITURE 

(CATALOG#. 

TYPE, 

COLOR) 

' Block  1 

/OOl 

TABLE 

BLACK 

1 

l002 

CHAIR 

BROWN 

Block  2 

.(003 

DESK 

RED 

1 

I 

(.004 

CHAIR 

RED 

; Block  3 

/005 

DESK 

GRAY 

(006 

TABLE 

BROWN 

A query  sight  be,  "What  is  the  CATALOG#  of  a brown  chair?"  Let  CATALOG# 's 
001  and  002,  003  and  004,  and  005  and  006  be  grouped  together  to  form 
the  blocks  shown,  and  let  Indexes  specify  the  TYPES  and  COLORS  in  each 
block.  A bit  can  be  used  to  indicate  whether  a field  value  is  contained 
in  a given  block. 


CHAIR  - 

110 

(blocks  1,2) 

DESK  - 

oil 

(blocks  2,3) 

TABLE  - 

101 

(blocks  1,3) 

BLACK  - 

100 

(block  1) 

BROWN  - 

101 

(blocks  1,3) 

GRAY  - 

001 

(block  3) 

RED 

010 

(block  2) 

A Boolean  "and"  of  CHAIR  and  BROWN  indicates  that  the  only  block  %fhlch 
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can  satisfy  the  query  is  block  1.  The  number  of  records  examined  is 
minimized  if  the  number  of  blocks  is  equal  to  the  square  root  of  the 
number  of  records  (Martin,  1975,  pp.  252*253)1^ 

Block  hierarchies  which  correspond  to  physical  hierarchies  are 


I 

1 ■, 
I ' 


particularly  important.  For  example,  each  step  in  the  hierarchy  from 
disk  track,  to  cylinder,  and  finally  to  disk  pack,  mass  storage  cylinder 


I 


> ; or  magnetic  tape,  represents  an  Increase  in  access  time  of  at  least 

I 

1 one  order  of  magnitude.  Processing  time  therefore  increases  greatly 

i whenever  a block  is  not  contained  within  one  of  these  storage  units. 

- ■ I 

: J I A reasonable  time  at  which  to  reorganize  a block  (e.g.,  to  collect  free 
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space)  Is  when  the  block  Is  about  to  overflow  one  of  these  physical 
boundaries.  However,  a block  can  sometimes  be  processed  more  quickly  by 
dividing  It  Into  sub-blocks  which  can  be  processed  Independently  - e.g., 
dividing  a block  among  different  disk  volumes  (Martin,  1975,  pp.  210- 
213).^^  It  may  also  be  quite  advantageous  to  divide  a block  according  to 
activity  of  the  sub-blocks  - relatively  Inactive  sub-blocks  can  be  placed 
together  on  the  outermost  cylinders  (which  have  slowest  access)  (Martin, 
1975,  pp.  212-214),^^  or  even  on  slower  devices.  Different  sub-blocks  may 
also  be  organized  differently,  according  to  activity  - e.g.,  the  records 
of  an  active  sub-block  could  be  structured  for  fast  retrieval  while  those 
of  an  Inactive  sub-block  could  be  structured  for  storage  efficiency. 

19 

Knuth  discusses  various  empirically  common  distributions  of  activity. 

In  summary,  then,  simple  access  paths  can  be  combined  Into  a hierarchy 
of  access  paths,  each  level  of  which  Is  a path  to  a smaller  and  more 
restricted  subset  of  the  data  base.  The  access  path  through  records 
at  a given  level  may  consist  of  one  or  more  types  of  structures  (described 
below).  The  design  problem,  then,  is  to  determine  an  appropriate  hierar- 
chical structure  and  the  parameters  at  each  level.  Detailed  time  and 

20  21 

storage  analyses  have  been  made  by  Severance  and  Yao,  and  have  been 


the  foundations  for  computer  programs  which  produce  near-optimal  physical 
access  paths.  More  specialized  programs  have  been  produced  for  the 

19.  Knuth,  D.  E.,  "The  Art  of  Computer  Programming,  Volume  3: 

Sorting  and  Searching",  Addlson-Uesley,  Reading,  Massachusetts,  pp.396- 
399  (1973). 

20.  Severance,  D.  G.,  "Some  Generalized  Modeling  Structures  for 
Use  In  Design  of  File  Organizations",  PhD  dissertation,  the  University 
of  Michigan,  Ann  Arbor  (1972). 

21.  Yao,  S.  B.,  "Evaluation  and  Optimization  of  File  Organizations 
Through  Analytic  Modeling",  Ph.D.  dissertation.  The  University  of  Michigan, 
Ann  Arbor  (1974). 
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hierarchical  IMS  aystca^^  and  the  CODASTL  DBTG  network  aodel.^^  An 

24 

analysis  by  Senko  et  al  has  been  used  as  the  basis  for  modeling  and 

25 

simulating  a coimaerclally  available  data  base  management  system. 

26 

Cardenas  has  developed  a model  and  program  to  de*^(>rmine  storage''*-*^^..^ 


costs  and  access  times  for  tree.  Inverted  file,  and  multilist  structures 
assuming  different  query  complexities  and  data  base  characteristics. 

The  following  subsection  will  use  the  block  concept  as  the  basis 
for  classifying  the  various  types  of  physical  access  paths.  The  emphasis 
Is  on  very  simple  alternatives:  "data”  vs  "structure",  "contiguity" 
vs  "chaining",  "ordering"  vs  "no  ordering",  "direct"  vs  "sequential" 
access.  The  objective  Is  not  elegance  In  the  mathematician's  sense  of 
finding  the  simplest  set  of  primitives,  but  rather  to  display  all  the 
choices  which  can  be  made.  Such  a low-level  approach  to  the  design  of 
physical  access  paths  will.  It  Is  hoped,  help  to  Isolate  and  correct 
those  choices  which  are  costly.  This  Is  Important  In  any  realistic 
environment,  where  changing  patterns  of  data  base  usage  require  continual 
re- tuning  of  the  physical  structures. 


22.  Smith,  S.  E.,  and  J.  H.  Mommens,  "Automatic  Generation  of  Physi- 
cal Data  Base  Structures",  Pro£.  1975  ACM-SIGMOD  International  Conference 
on  the  Management  of  Data,  ACM,  New  York,  pp.  157-165  (1975). 

23.  Gerrltsen,  R.,  "A  Preliminary  System  for  the  Design  of  DBTG  Data 
Structures",  The  Wharton  School,  University  of  Pennsylvania,  Philadelphia, 
(1975).  Also  published  In  CACM,  Vol.  18,  No.  10,  pp.  551-557  (October, 
1975) . 

24.  Senko,  M.  E.,  E.  B.  Altman,  M.  M.  Astrahan,  and  P.  L.  Fehder, 
"Data  Structures  and  Accessing  In  Data-Base  Systems",  IBM  Systems  J., 

Vol.  12,  No.  1,  pp.  30-93  (1973). 

25.  Schneider,  L.  S.  and  C.  R.  Spath,  "Quantitative  Data  Descrip- 
tion", Proc.  1975  ACM-SIGMOD  International  Conference  on  the  Management 
of  Data,  ACM,  New  York,  pp.  169-185  (1975). 

26.  Cardenas,  A.  F.,  "Evaluation  and  Selection  of  File  Organization  - 
A Model  and  System",  CACM,  Vol.  16,  No.  9,  pp.  540-548  (September,  1973). 
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CLASSIFYING  PHYSICAL  ACCESS  PATHS 


Data  and  Structure 

A block  may  contain  "data"  (l.e.,  representations  of  the  logical 
data  ultimately  delivered  to  or  obtained  from  a user)  and  "structure" 
(i.e.,  information,  such  as  record  addresses,  trhich  is  unknotm  to  the 
user,  and  which  directs  the  traversal  of  the  physical  access  path).  The 
structural  information  may  indicate,  at  a certain  time,  that  a block 
which  can  potentially  contain  data  is  in  fact  empty  - i.e.,  that  the 
record  contains  "free  space".  Such  records  may  be  on  the  same  physical 
access  paths  as  those  containing  data,  or  may  be  on  distinct  access 
paths. 

Pointers 

A "pointer"  from  a record  X of  block  A to  a record  Y of  block  B can 
be  physically  present  in  either  X or  Y,  or  in  a separate  structural 
block,  C.  If  the  pointer  is  in  X,  it  can  be  a physical  address,  a data 
base  key  (i.e.,  an  offset  into  a "translation  table"  of  physical  addres- 
ses), or  a bit  in  a bit  map  (l.e.,  each  record  in  B is  assigned  a posi- 
tion in  a bit  string,  so  a 1 in  Y's  position  indicates  that  X points  to 
Y) . Access  from  X to  Y Involves  no  searching.  If  the  pointer  is  really 
a "back  pointer"  in  Y,  the  same  possibilities  exist,  except  that  in  this 
case  B is  searched  to  find  the  record  (or  records)  which  point  back  to  X. 
This  can  be  of  great  value  if  A is  frequently  accessed  and  resides  on 
expensive  storage,  while  B is  Infrequently  accessed  and  resides  on 
cheaper  storage,  since  the  pointer  is  on  the  cheaper  device  and  does  not 
have  to  be  read  when  records  in  A are  read.  If  the  pointer  is  in  C,  then 
it  actually  consists  of  two  pointers  - one  to  X and  one  to  Y.  It  could 
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be  a collection  of  pairs  of  addresses  or  data  base  keys,  ordered  on  one 
or  both  blocks,  or  a ttro-dlmenslonal  bit  map.  Data  compression  techniques 
can  be  employed  on  bit  maps  or  on  collections  of  address  pairs.  In  the 

I 

remainder  of  this  section,  a pointer  Is  generally  assumed  to  be  either 
an  address  or  data  base  key  In  X.  Other  possible  Implementations  should 
be  considered  In  special  cases.  For  example,  an  encoded  bit  map  may  be  a 
very  efficient  Implementation  of  an  Index  of  records  with  a particular 
field  value. 

Contiguity  and  Chaining 

Blocks  may  be  combined  Into  higher-level  blocks  by  contiguity  (l.e., 
the  address  of  the  1-t-l-th  block  of  the  collection  Is  equal  to  the  address 
plus  the  length  of  the  1-th  block)  or  chaining  (l.e.,  the  1-th  block  of 
the  collection  points  to  the  1+1-th  block) . The  traversal  of  a physical 
access  path  Involves  two  types  of  movement:  one  from  block  to  block  of  a 
collection  until  an  appropriate  block  Is  found  (l.e.,  horizontal  tree 
traversal);  the  other  a search  within  that  block  (l.e.,  movement  down  the 
tree).  A block  may  be  constructed  using  both  contiguity  and  chaining  - 
e.g.,  the  records  of  an  ISAM  track  are  contiguous,  with  the  last  record 
possibly  beginning  a chain  of  overflow  records.  A block  may  also  consist 
of  different  types  of  sub-blocks  - the  leaves  (records)  may  occur  at 
different  levels  In  the  tree  constituting  the  sub-block. 

Ordering,  Direct  Access,  and  Sequential  Access 

If  sub-blocks  within  a block  have  known  addresses  (e.g.,  all  addres- 
ses are  known  If  the  sub-blocks  are  of  fixed  length  and  contiguous)  and  are 
ordered  according  to  a certain  field,  then  a search  for  a sub-block  with 
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a particular  value  for  that  field  can  be  performed  in  three  different 
ways.  First,  it  may  be  possible  to  directly  access  the  desired  sub-block 
- i.e. , the  field  value  is  transformed  by  a 1-to-l  function  into  the 
address  of  the  sub-block  or  some  other  sub-block  from  which  it  can  be 
reached.  Second,  it  may  be  possible  to  access  a sub-block  which  is  either 
that  desired,  or  which  can  begin  another  search  - i.e.,  the  value  is 
transformed  by  a many-to-one  function  into  the  addresses  of  the  desired 
sub-block  or  some  other  sub-block  from  which  it  can  be  reached.  Third, 
sub-blocks  can  be  accessed  sequentially  until  the  desired  sub-block  is 
found.  Access  must  be  sequential  if  the  addresses  of  sub-blocks  are 
unknown  or  if  the  sub-blocks  are  not  ordered  on  the  appropriate  field. 

The  following  subsection  will  present  examples  of  different  physical 
access  paths  which  can  be  constructed  by  various  combinations  of  dif- 
ferent kinds  of  block  structures. 

EXAMPLES  OF  PHYSICAL  ACCESS  PATHS 
Sequential 

A sequential  file  - e.g. , a tape  file  - is  a block  of  contiguous  data 
records,  for  example:  . 


table 

BLACK 

CHAIR 

BROWN 

DESK 

RED 

CHAIR 

RED 

DESK 

GRAY 

TABLE 

BROWN 

The  arrow  indicates  that  the  address  of  the  first  record  is  known. 
Ordering  may  reduce  retrieval  time  but  increases  maintenance  time. 

For  example,  retrieval  time  for  all  items  of  a certain  color  is  approxi- 
mately one-third  as  great  for  an  ordered  as  for  an  unordered  tape  file 


i 

I 

1 

1 

1 


j 

i 
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(assuming  that  color  values  are  uniformly  distributed  in  the  requests 


and  that  the  tape  can  be  read  forwards  or  backwards  from  any  point  within 


the  tape);  however,  inserting  a new  record  involves  copying  the  entire 


tape.  The  trade-off  between  retrieval  and  maintenance  costs  is  a general 


characteristic  of  physical  access  paths 


A list  is  a chained  block  of  data  records,  for  example 


TABLE 


BLACK 


CHAIR 


BROWN 


DESK 


CHAIR 


DESK 


GRAY 


TABLE 


BROWN 


This  list  is  ordered  on  color  and  unordered  on  type  of  furniture.  The 


objective  of  ordering  is  to  reduce  retrieval  time,  but  the  cost  of 


maintenance  is  much  more  moderate  than  with  a -sequential  tape  file 


Insertion  requires  only  an  additional  search  for  the  proper  predecessor 


An  index  is  an  ordered  contiguous  block  of  structure  records.  The 


diagram  below  shows  two  indexes  to  a multilist 


plus  the  data  records 


themselves 


27.  Lefkovitz,  D.,  "File  Structures  for  On-Line  Systems 
Rochelle  Park,  New  Jersey,  pp.  126-129,  157-165  (1969). 

28.  Lefkovitz,  D. , "Data  Management  for  On-Line  Systems 
Rochelle  Park,  New  Jersey,  pp.  62-64,  118-120  (1974). 
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TYPE  INDEX 


2 

CHAIR 

2 

DESK 

2 

TABLE 

DATA 


COLOR  INDEX 


TABLE 

001  BLACK 

CHAIR 

006  BROWN 

- 

DESK 

009  RED 

CHAIR 

002  RED 

DESK 

007  GRAY 

TABLE 

003  BROWN 

BLACK 

1 

BROWN 

2 

GRAY 

1 

RED 

2 

The  numbers  indicate  Che  lengths  of  Che  lists.  Each  index  entry  is  Che 
address  of  the  beginning  of  a chain  of  records  with  the  indicated  field 
value.  Note  Chat  each  chain  is  in  ascending  order  of  physical  address; 
Che  reason  for  this  will  be  discussed  in  the  following  paragraphs.  The 
maintenance  of  such  a chain  is  more  complex  than  for  an  unordered  chain. 
If  space  for  new  records  is  allocated  from  the  high  end  of  free  space, 
Chen  a new  record  will  have  an  address  lower  Chan  previously  allocated 
records,  placing  it  at  the  beginning  of  the  chains;  periodic  reorganiza- 
tion is  necessary  Co  move  new  free  space,  resulting  from  deletions,  to 
the  lower  addresses.  Otherwise,  each  chain  must  be  followed  to  find  the 
proper  place  Co  insert  Che  new  record;  hence,  inserting  into  a multilist 
of  n chains  requires  n searches  for  the  proper  locations  in  order  to 
maintain  ascending  order.  To  satisfy  a complex  query  such  as,  "What  is 
Che  CATALOG#  of  a brown  chair?",  Che  shorter  of  Che  BROWN  and  CHAIR  lists 
will  be  followed  (in  this  case,  they  are  equally  long),  checking  each 
record  for  Che  other  field  value. 

If  ascending  order  is  maintained,  retrieval  may  be  much  faster, 
since  cylinder  access  is  minimized.  A detailed  analysis  is  given  in 
the  following  paragraph. 
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Analysis  of  Time  for  Multilist  and  Sequential  Retrieval 

A multilist  organization  which  Is  not  ordered  by  ascending  physical 
address  may  actually  be  less  efficient  In  retrieval  than  a sequential 
file,  since  It  Is  possible  to  consume  more  time  accessing  a few  records 
in  random  locations  than  all  records  sequentially.  Assume,  for  example, 
that  k records  are  retrieved  from  a disk  file  scored  on  M tracks.  Assume 
further  that  Che  time  for  one  rotation  Is  R,  that  the  time  to  move  to  an 
adjacent  cylinder  Is  C^,  that  Che  time  to  move  over  j cylinders,  for  J»l, 
Is  C^,  and  that  there  are  n Cracks  per  cylinder.  Then,  Ignoring 

positioning  to  the  first  record,  Che  time  Co  retrieve  k records  by 
following  a chain  of  randomly  distributed  addresses  Is 
Tj^(k-l)  (C2+  C^m/3n  + R/2) 

Assuming  Chat  pm  Cracks  contain  at  least  one  record,  and  that  at  least 
one  record  Is  contained  on  each  cylinder,  the  time  Co  retrieve  Che  k 
records  by  following  a chain  of  ascending  addresses  is 
T^»(m-1)  Cj^/n  + (pm-1)  R 

An  equation  for  p will  now  be  derived.  Assuming  Chat  there  are  b records 
per  track,  there  are  therefore  bm  records,  of  which  k are  to  be  retrieved. 
The  probability  of  retrieving  any  given  record  is 
k/bm 

Hence,  the  probability  of  not  retrieving  a given  record  Is 
I - k/bm 

and  the  probability  of  not  retrieving  any  record  from  a Crack  Is 
(1  - k/bm)’’ 


66 


Hence, 

p“  l-(l-k/bm)'* 

- k/m  - k^(b-l)/21«^b  + k^(b-l)(b-2)/3l«V  - . . . 

For  k « m, 

r 

pm  * k 
so  that 

(m-1)  C,/n  + (k-1)  R 
A 1 

The  time  to  sequentially  retrieve  all  records  from  all  m tracks  Is 

T (m-1)  C, /n  + m R 
s 1 

For  an  IBM  2314  disk  drive,  n-20,  R~  25  ms  (milliseconds),  25  ms, 

^ 18 

50  ms,  and  .4  ms  (IBM,  1974,  p.22).  Hence,  in  milliseconds, 

T ■5'  (k-1)  (62.5  + .007m) 

K 

T,~  1.25  (m-1)  + 25  (k-1) 

A 

Tg-  26m 

The  following  table  gives  approximate  values  of  T^,  T^,  and  T^  In  seconds 
for  m-2000  tracks  and  the  Indicated  values  of  k: 


K 

Tr 

Ta 

Ts 

100 

7 

5 

52 

500 

38 

15 

52 

700 

53 

20 

52 

1000 

76 

23-27* 

52 

For  larger  values  of  k,  T begins  to  approach  T , while  T continues  to 

A 9 Iv 

grow  linearly.  For  smaller  values  of  k,  T and  T„  are  approximately 
equal.  This  analysis,  of  course.  Ignores  CPU  time  and  channel  activity, 
which  could  be  significant  for  the  sequential  organization.  However,  It 

*Thls  entry  depends  on  b,  the  number  of  records  per  track,  since  the 
condition  k«  m does  not  hold.  The  value  Is  23  for  large  b (l.e., small 
records)  and  27  for  b»l. 

# 
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does  serve  to  lllustrste  the  points  thst  ordering  by  incresslng  sddress 
Is  generally  advantageous,  and  that  sequential  organization  may  be 
reasonable  even  for  a rather  low  "hlt-ratlo”  (density  of  records  to  be 
retrieved) . 


Bounded  Multilist 

A second  level  of  Index  may  be  used  to  bound  the  lengths  of  the 
chains  in  the  multilist.  For  example, 

TYPE  INDEXES  DATA  COLOR  INDEXES 


f ^ \ f ^ N 


For  clarity,  some  pointers  are  not  shorn.  Note  that  the  TYPE  and  COLOR 
fields  are  omitted.  A complex  query  such  as  that  Introduced  earlier  In 
the  section,  "Vftiat  Is  the  CATALOG#  of  a brown  chair?",  can  be  more  effi- 
ciently satisfied  If  both  the  starting  and  ending  addresses  of  the  chairs 
are  Indicated  In  the  Index:  In  this  case  only  the  first  and  third  TYPE 
chains  need  be  examined.  Note  that  both  levels  of  Index  are  relatively 
stable  compared  with  the  data.  Insertions  and  deletions  In  the  data  will 
only  rarely  result  in  Insertions  or  deletions  In  either  Index,  so  that 
contiguous  organizations  are  appropriate.  Bounding  the  chains  may  greatly 


facilitate  maintenance  of  the  data  recorda,  alnce  a abort  chain  vlll  be 
followed  to  determine  the  proper  place  of  Inaertlon.  If  each  chain  la 
reatrlcted  to  a single  "cell"  (l.e.,  a physically  significant  block,  such 
as  a cylinder) , then  the  analysis  at  the  beginning  of  this  subsection 
applies,  and  access  time  can  be  greatly  reduced. 


Inverted  File 

If  the  multilist  chains  are  restricted  to  single  records,  the 
result  Is  an  Inverted  file.  For  example. 


The  complex  query  referred  to  earlier,  "What  Is  the  CATALOG#  of  a brown 
chair?",  can  now  be  reduced  to  a comparison  of  two  lists  (ordered  In 
this  case)  In  the  Indexes  and  one  reference  to  a data  record.  The  total 
ntxDber  of  pointers  Is  approximately  the  same  as  for  either  the  multilist 
or  bounded  multilist,  but  now  all  pointers  are  In  an  Index  block,  rather 
than  In  a data  block.  This  Is  desirable  If  some  fields  are  used  In 
more  queries  than  other  fields,  since  they  can  be  placed  on  faster 
storage  devices.  The  primary  disadvantage  of  the  Inverted  file  Is  that 
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■alnccnmc*  is  costly  - ssch  Indsx  Bust  bs  updstsd  svery  tlas  s record  Is 
updated.  In  the  case  of  the  bounded  aultillst,  update  could  easily  be 
deferred,  since  the  only  effect  of  a chain  growing  beyond  the  optimal 
size  is  a snail  degradation  in  performance. 


TRIE  and  TREE 

Tree-structured  blocks  can  be  either  static  (i.e.,  the  number 
of  levels  is  fixed,  although  the  contents  of  the  nodes  nay  vary)  or 
dynamic  (I.e.,  the  shape  of  the  tree  varies  with  Insertions  and  deletions) 
Static  trees  are  essentially  hierarchies  of  Indexes,  as  In  the  ISAH 
example  presented  earlier.  Here  Is  another  example: 


CHAIR 


TABLE 


BROWN  T RED 


IciiRlIlEITIIII 


Note  that  the  upper  index  contains  all  possible  values  for  TYPE  (some  of 
the  pointers  could,  however,  be  null),  vtille  the  lower  index  contains  only 
those  COLOR  values  which  correspond  to  ATALOGl's.  These  will  be  referred 
to  as  TRIE^^  (Pronounced  "try”)  and  TREE^®’  structures,  respectively. 

The  decision  of  when  to  use  TRIE  and  ifhen  to  use  TREE  is  based  on  storage 
costs  (the  TREE  Is  a compressed  TRIE,  so  the  analysis  of  the  section  on 
Design  of  Physical  Records  is  appllcabls),  and  on  retrieval  and  maintenance 
costs.  Note  that  the  TRIE  does  not  gro<;,  so  should  be  ordered  and  con- 

29.  Fredkin,  E.,  "TRIE  Memory",  CACM,  Vol.  3,  pp.  490-499  (September 
I960). 

30.  de  la  Brlandals,  R.,  "File  Searching  Using  Variable  Length  Keys", 
Proc.  1959  Western  Joint  Computer  Conference,  IEEE,  New  York,  pp.  295-298 
(1959). 

31.  Se'fevance,  D.  G.,  "Identifier  Search  Mechanisms:  A Survey  and 
General  Modei",  ACM  Computing  Surveys,  Vol.  6,  No.  3,  p.  186  (September 
1974). 
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tlguous  to  permit  direct  access.  The  TREE  can  grow  (or  shrink),  so 
that  Its  organization  can  be  either  contiguous  or  chained,  or  a mixture, 
and  either  ordered  or  unordered,  depending  on  the  degree  and  type  of 
maintenance  activity.  The  TRIE  organization  Is  preferred  when  field 
values  are  dense.  For  example,  the  first  letters  of  names  In  a large 
telephone  directory  are  dense  - all  letters  occur  at  least  once.  However, 
the  TREE  organization  Is  preferred  when  field  values  are  sparse.  For 
example,  the  fifth  letters  of  names  In  a telephone  directory  are  extremely 
sparse  - very  few  different  letters  can  occur  after  four  letters  have 
already  occurred.  Severance^^’^^  has  developed  a generalized  character- 
string search  model,  described  further  at  the  end  of  this  subsection, 
which  exploits  this  rather  common  phenomenon.  The  first  steps  In  his 
generalized  search  are  two  prefix  searchs  - l.e.,  a search  on  the  first  1 
characters  (a  TRIE  since  the  density  Is  high)  followed  by  a search  on  the 
next  j characters  (a  TREE  since  the  density  Is  much  lower)  - follovfed  by 
a search  of  a (hopefully)  small  array  or  list  to  match  any  remaining 
characters.  Consider  the  example  of  a set  of  colors,  say  BEIGE,  BLACK, 
BLUE,  BROWN, ... ,RED, ... ,YELLOW.  The  density  of  first  letters  Is  rather 
high,  so  It  can  be  represented  by  a one-letter  TRIE.  The  density  of 
second  letters  Is  much  lower,  but  still  substantial,  so  It  can  be 
represented  by  a one-letter  TREE.  After  the  second  letter,  however, 
there  Is  generally  only  one  color,  so  a list  Is  suitable  for  handling 
synonyms  from  the  TREE.  The  following  diagram  shows  the  result,  where 
the  letters  are  above  the  TRIE  records,  to  Indicate  that,  since  all 
characters  are  present,  they  are  Implicit  and  need  no  storage  space. 
Pointer  spaces  which  sre  empty  indicate  that  no  entry  exists  - e.g.,  no 
color  (in  this  set)  begine  trith  A. 
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LLOW 


Note  the  decisions  tihlch  have  been  made:  one  character  In  the  TRIE 


search,  which  maps  Into  a TREE,  one  character  In  the  TREE  search,  con 


tigulty  rather  than  chaining  In  the  TREE  search,  and  chaining  In  the 


final  list 


Search  trees  (Knuth,  1973 


are  tree-structured  blocks 


which  can  grow  vertically,  for  example,  the  following  unbalanced  binary 


Unless  otherwise  noted,  all  search  trees  to  be  discussed  are  binary 
The  circled  numbers  Indicate  the  order  of  Insertion  Into  the  tree. 


Given  a value,  say  007,  search  begins  at  the  root.  If  the  value  Is 


not  at  that  node,  search  proceeds  with  the  left  or  right  subtree  as 
the  value  at  the  node  is  less  than  or  greater  than  the  search  value 


1 

i 


i 

\ 

\ 

{ 

i 

Hence,  the  sequence  of  nodes  visited  would  be  001,  005,  012,  and  finally  ! 

007.  insertion  Involves  an  (unsuccessful)  search  followed  by  linkage  of  | 

i 

the  new  record  to  the  left  or  right  of  the  last  node  reached  - e.g.,  017  j 

would  be  to  the  right  of  016.  Deletion  Is  simple  If  the  node  to  be  deleted 

Is  a leaf  or  has  only  one  descendant,  but  Is  somewhat  complicated  If  ^ 

19 

the  deleted  node  has  two  descendants  (Knuth,  1973,  pp.  428-429).  In 
the  latter  case  deletion  Involves  replacement  of  the  node  by  the  largest 
node  In  Its  left  subtree  or  the  smallest  node  In  Its  right  subtree.  For 
example,  012  can  be  deleted  by  replacing  It  with  either  Oil  or  013.  Node 
Oil  cannot  have  a right  descendant,  since  that  would  be  greater  than  011, 
and  Oil  was  chosen  as  the  largest  In  the  left  subtree  of  012.  Similarly, 

013  cannot  have  a left  descendant.  After  replacing  012,  the  old  Oil  or 

013  node  Is  removed,  and  Its  subtree  (If  any)  linked  Into  the  tree.  The 

result  of  replacing  012  by  013  would  be  as  follows,  where  unchanged  | 

subtrees  have  not  been  shown: 


Note  that  the  search  tree  Is  poorly  balanced  - l.e.,  some  paths  from 
root  to  leaf  are  much  longer  than  others.  Search  of  a perfectly  balanced 
binary  tree  of  2 -1  nodes  (K  levels)  requires  access  to  about  K nodes. 

V 

However,  (2  -l)/2  nodes  are  required  to  search  a tree  In  which  there  Is  a 
single  leaf  and  all  other  nodes  have  one  descendant,  as  In  this  tree: 
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Knuth  (1973,  pp.  426-^27}  shows  Chat,  for  randon  Insertions,  the  ex- 
pected number  of  nodes  accessed  in  a search  Is  about  1.4K,  which  may  or 
may  not  be  acceptably  close  to  the  minimum,  K.  However,  Insertions  come 
In  order  In  many  practical  cases.  For  example.  Insertion  of  new  part 
number,  say  001,  may  be  followed  by  the  Insertion,  In  order,  of  OOlA, 
OOlfi,  ...,  OOIN  - the  result  will  be  a list  of  maximum  length  14  and 
average  search  length  7,  rather  than  a balanced  tree  with  maximum  4 and 
average  search  length  2.  If  the  search  cost  Is  determined  by  the  search 
length,  this  degree  of  Imbalance  could  be  unacceptable.  However,  If  the 
tree  Is  on  a perlpherlal  storage  device,  then  nodes  OOlA,  OOIB,  ...,  OOIN 
would  probably  be  placed  In  nearby  locations,  since  they  were  Inserted  In 
sequence.  Proximity  depends  on  the  allocation  of  free  storage,  to  be 
considered  In  the  next  subsection.  Proximity  would  tend  to  make  the 
search  fast  and  efficient  In  the  utilization  of  Input-output  channels, 
since  the  number  of  track  and  cylinder  accesses  and  the  number  of  records 
transferred  are  more  significant  than  the  number  of  nodes  examined. 

If  a tree  Is  to  be  maintained  In  approximate  balance,  there  are 
three  alternatives.  First,  Insertions,  modifications,  and  deletions 
can  be  collected  Into  batches  and  ordered  (e.g.,  randomly,  to  try  to 
eliminate  long  lists)  before  maintenance  Is  performed.  Second,  a 
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reasonable  precaution,  even  given  the  first  alternative,  la  to  period- 
Ically  reorganize  the  data  base  (e.g.,  when  It  Is  not  available  to  on-  | 
line  users).  Third,  the  tree  may  be  examined  during  each  Insertion, 
deletion,  or  modification  and  reorganized  If  necessary  to  ensure  | 

approximate  balance. 

The  last  alternative,  maintenance  of  approximate  balance,  has  | 

received  a great  deal  of  attention.  Nlevergelt,  In  a survey  of  the 

32  ' 

field,  classifies  approximately  balanced  trees  as  either  "height- 

balanced"  or  "weight-balanced". 

A height-balanced  tree,  as  exemplified  by  AVL  trees*,  satisfies  | 

the  condition  that,  at  any  node,  the  heights  of  the  left  and  right  ' 

subtrees  differ  by  at  most  some  constant  (for  example,  1 for  AVL  trees).  i 

Thus,  the  following  Is  a height-balanced  tree:  | 


Search  of  a height-balanced  tree  is  slightly  longer  than  that  of  a 
' completely  balanced  tree,  but  reorganization  is  much  simpler.  The 

i maximum  search  length  in  a completely  balanced  tree  of  n nodes  is 

» 

log2<n  +■  1),  and  in  a height-balanced  tree  Is  1.A4  log^Cn  -f  1);  the 

32 

average  search  lengths  differ  by  a constant.  More  general  height- 

32.  Nlevergelt,  J.,  "Binary  Search  Trees  and  File  Organization", 
I - ACM  Computing  Surveys,  Vol.  6,  No.  3,  pp.  195-207  (September 

I 1974). 

*So-called  after  the  originators  Adel'son-Vel'skil  and  Landis. 
Described  In  Knuth,  1973,  pp.  451-468. 
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balanced  treea  are  described  by  Foster.  A perfectly  height-balanced 
but  non-binary  "B-tree”  Is  described  by  Bayer;  this  tree  has  the  Inter- 


esting property  that  It  grows  horizontally  as  long  as  branches  are 
available,  and  then  grows  a new  root^^*^^*^^  (Knuth,  1973,  pp.  473-479).^' 

A "weight-balanced"  tree,  as  exemplified  by  the  BB-tree  of  Nlever- 

37 

gelt,  has  the  property  that,  at  any  node,  the  ratio  of  the  number  of 
leaves  In  the  left  and  right  subtrees  Is  bounded  above  and  below.  The 
tree  can  be  more  or  less  balanced,  depending  on  the  bounds,  and  thus  pro- 
vide any  desired  compromise  between  retrieval  and  maintenance  speeds. 
Maintenance  of  weight  balance  seems  more  efficient  than  that  of  height 
balance  (though  It  may  Involve  more  operations),  since  weight  imbalance 
can  be  corrected  at  the  node  at  which  It  Is  detected:  height  Imbalance 

at  the  root  may  not  be  detected  until  a leaf  Is  reached.  Hence,  correc- 
tion of  height  Imbalance  Involves  a separate  bottom-up  pass,  which  could 
be  time  consuming  for  a tree  on  peripheral  storage. 

In  general,  retrieval  and  maintenance  operations  on  approximately 
balanced  trees  of  n nodes  require  access  to  a number  of  records  which 


Is  on  the  order  of  log  n.  Unbalanced  trees  may  require  access  to  a 


33.  Foster,  C.  C.,  "A  Generalization  of  AVL  Trees",  CACM,  Vol.  16, 
No.  8,  pp.  513-517  (August  1973). 

34.  Bayer,  R.  "Binary  B-trees  for  Virtual  Memory",  Proc.  1971 
ACM-SIGFIDET  Workshop  on  Data  Description,  Access  and  Control,  ACM,  New 
York,  pp.  219-235  (1971). 

35.  Bayer,  R.,  "Symmetric  Binary  B-trees:  Data  Structure  and  Main- 
tenance Algorithms",  Acta  Informatlca,  Vol.  1,  No.  4,  pp.  290-306  (1972). 

36.  Bayer,  R.,  and  E.  M.  McCrelght,  "Organization  and  Maintenance 
of  Large  Ordered  Indexes",  Acta  Informatlca,  Vol.  1,  No.  3,  pp.  173-189 
(1972). 

37.  Nlevergelt,  J.,  and  E.  M.  Relngold,  "Binary  Search  Trees  of 
Bounded  Balance",  SIAM  J.  Computing,  Vol.  2,  No.  1,  pp.  33-43  (March  1973) 
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maximum  number  of  records  on  the  order  of  n,  so  might  appear  leas  attrac- 
tive. However,  efficient  maintenance  of  balance  requires  extra  storage 
for  data  about  the  relative  balance.  This  extra  storage  reduces  the 
number  of  nodes  which  can  be  read  in  one  access  from  a peripheral  device, 

and  hence  Increases  the  number  of  accesses. 

38 

Held  and  Stonebraker  argue  that  static  trees  with  overflow,  such  as 
ISAM,  may  be  preferable  to  B-trees  since  they  require  fewer  pointers  and 
hence  fewer  accesses.  An  example  demonstrates  that  periodic  reorgani- 
zation costs  less  than  B-tree  maintenance.  However,  the  greater  cost  of 
maintenance  may  not  be  as  Important  as  a guarantee  that  the  average  or 
maximum  retrieval  time  will  not  exceed  a certain  value;  some  form  of 
balanced  tree  would  be  required  to  provide  such  a guarantee. 


Hashing 

^ Hashing,  or  key  transformation,  is  a technique  for  mapping  a large 
collection  of  possible  record  identifiers  onto  a much  smaller  number  of 
addresses  of  "buckets". Each  bucket  consists  of  a fixed  number  of 
"slots",  and  each  slot  can  hold  one  record.  For  example,  hashing  is 
frequently  used  in  building  symbol  tables  for  compilers  or  assemblers. 

The  number  of  possible  variable  names  is  extremely  large  - there  are 
26  X 36^  possible  six-character  names  in  FORTRAN,  for  example  - but  the 


38.  Held,  G.,  and  M.  Stonebraker,  "B-trees  Re-examined",  Memorandum 
No.  ERL-M528,  Electronics  Research  Laboratory,  University  of  California, 
Berkeley  (2  July  1975). 

39.  Maurer,  W.  D.,  and  T.  G.  Lewis,  "Hash  Table  Methods",  ACM 
Computing  Surveys,  Vol.  7,  No.  1,  pp.  6-19  (March  1975). 

40.  Peterson,  W.  W.,  "Addressing  for  Random-Access  Storage",  IBM  J. 
Research  and  Development,  Vol.  1,  No.  2,  pp.  130-146  (April  1975). 
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number  of  actual  names  In  even  a very  large  program  Is  comparatively 
small  - say  a few  hundred  or  thousand.  If  a function  can  be  found  which 
maps  the  actual  names  "uniformly"  Into  the  set  of  bucket  addresses  - 
l.e.,  the  number  of  names  mapped  Into  each  bucket  is  approximately  the 
same  for  all  buckets  - then  any  particular  name  can  be  found  In  about  one 
access.  Some  provision  must  be  made  for  handling  "collisions"  - too  many 
records  mapping  into  one  bucket,  causing  "overflow."  Hence,  insertion  of 
a record  Involves  two  processes:  mapping  the  record  into  a "home" 

bucket,  and.  If  that  bucket  has  no  free  slot,  "probing"  (following  a 
sequence  of  overflow  addresses)  until  free  space  Is  found. 

The  following  paragraphs  discuss  different  mapping  functions,  methods 
for  handling  overflow,  and  the  advantages  of  different  bucket  sizes. 

Note  that  a sequence  of  overflow  records  is  really  a (hopefully  small) 
block  - hence,  design  of  an  overflow  record  Is  really  another  problem  In 
physical  access  path  design.  Two  subproblems  are  Involved:  one  to  deter- 

mine a free  storage  area,  and  the  other  to  link  the  overflow  records 
together.  A bucket  is,  of  course,  a level  in  the  hierarchy  of  blocks  - 
the  organization  of  records  within  a bucket  must  be  determined. 

i 

The  mapping  function  from  the  set  of  identifiers  to  the  set  of 
bucket  addresses  is  clearly  quite  important  - a bad  function  will  cause  a 
large  number  of  collisions,  with  long  overflow  sequences.  Fortunately,  a 
very  simple  function  - division  hashing,  or  using  the  identifier  modulo 
the  number  of  buckets  - proved  to  be  generally  quite  good  in  real  appli- 
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cations  studied  by  Lum,  Yuen,  and  Dodd.  * Division  hashing  was.  In 

fact,  better  than  a randomizing  function.  That  Is,  a random  assignment  of 

addresses  to  Identifiers  does  not  exploit  the  regularities  In  real  data, 

43 

which  are  exploited,  to  some  degree,  by  division  hashing.  Buchholtz 
and  Knuth  (1973,  pp.  508-513,  521-524,  542-543)^^  also  discuss  other 
mapping  functions  which  may  be  useful  In  special  situations. 

The  two  most  common  methods  of  overflow  sequencing  are  "open"  and 
"chained".  An  open  sequence  Is  generated  by  a fixed  series  of  modlflca- 
tlons  to  the  original  hashing  function  - e.  g..  In  linear  probing  the 
n+l-th  address  to  be  examined  Is  simply  the  first  plus  a multiple  of 
n,  modulo  the  size  of  the  prime  area.  Open  overflow  has  the  advantages 
of  simplicity  and  storage  economy.  However,  If  the  previous  simple 
sequence  of  overflow  addresses  Is  generated,  overflow  sequences  may 
become  quite  long  through  a process  of  "clustering"  of  sequences  of 
different  "synonyms"  (records  mapping  Into  the  same  prime  bucket).  For 
example,  using  division  hashing  with  a prime  area  size  of  7 and  a bucket 
size  of  1,  and  linear  probing  with  constant  1,  the  Insertion  sequence  1, 
8,  2,  9 produces  the  following  result: 


(Bucket  numbers  are  shown  above  the  buckets.)  Inserting  15  Involves 


41.  Lum,  V.  Y.,  P.  S.  T.  Yuen,  and  M.  Dodd,  "Key-to-Address  Trans- 
form Techniques:  A Fundamental  Performance  Study  on  Large  Existing 
Formatted  Files",  CACM,  Vol.  14,  No.  4,  pp.  228-239  (April  1971). 

42.  Lum,  V.  Y.,  and  P.  S.  T.  Yuen,  "Additional  Results  on  Key-to- 

Address  Transform  Techniques:  A Fundamental  Performance  Study  on  Large 

Existing  Formatted  Files",  CACM,  Vol.  15,  No.  11,  pp.  996-997  (November 
1972) . 

43.  Buchholz,  W.,  "File  Organization  and  Addressing",  IBM  Systems  J., 
Vol.  2,  pp.  86-111  (June  1963). 
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following  the  bucket  sequence:  1,  2,  3,  4 and  finally  insertion  in 


bucket  5.  Note  that  1 and  8 are  synonyms,  2 and  9 are  synonyms,  and  the 


sequences  of  synonyms  have  run  together  to  form  a i^ong  clustered  overflow 


sequence.  This  phenomenon  can  be  avoided  by  a more  complex  overflow 


sequence.  For  example,  in  quadratic  probing 


the  sequence  of  synonyms  is  the  first  address  plus  a quadratic  function 


modulo  the  size  of  the  prime  area.  The  previous 


example  would  be 


and  15  would  be  Inserted  in  bucket  5 after  following  the  bucket  sequence 


1,  2.  The  overflow  sequences  no  longer  coalesce.  A possible  disadvantage 


is  that  only  one-half  the  bucket  addresses  will  be  generated  before  the 


sequence  repeats,  given  that  the  number  of  buckets  is  a prime.  This  may 


not  be  significant,  since  the  overflow  sequence  is  then  so  long  that 


reorganization  is  necessary.  Techniques  have  been  developed,  however,  for 


searching  the  entire  table  (Radke,  1970,  p.  104) 


Of  particular 


Importance  in  many  applications  are  numbers  of  buckets  which  are  powers 


of  two;  full-table  searching  is  possible  with  appropriate  choice  of  the 


44.  Maurer,  W.  D.,  "An  Improved  Hash  Code  for  Scatter  Storage", 
CACM,  Vol.  11,  No.  1,  pp.  35-38  (January  1968). 

45.  Radke,  C.  E.,  "The  Use  of  Quadratic  Residue  Research",  CACM 
Vol.  13,  No.  2,  pp.  103-105  (February  1970). 

46.  Day,  A.  C.,  "Full  Table  Quadratic  Quotient  Searching",  CACM 
Vol.  13,  No.  8,  pp.  481-482  (August  1970). 


quadratic.^^’^®’^^  "Secondary  clustering",  the  phenonenon  by  which 
actual  synonyms  map  Into  the  same  sequences  of  overflow  addresses,  can 
also  be  eliminated  by  making  each  new  overflow  address  depend  on  the 
identifier,  rather  than  on  Its  hashed  value. 

The  average  number  of  buckets  accessed  to  locate  a record  using  open 
overflow  with  various  bucket  sizes  Is  given  below  for  various  "loading 
factors"  (the  number  of  records  divided  by  the  number  of  slots)  (Knuth, 
1973,  p.  535):^^ 


Bucket 


Loading  Factor 


Size 

.70 

.80 

.90 

.95 

1 

2.2 

3.0 

5.5 

10.5 

2 

1.5 

1.9 

3.1 

5.6 

5 

1.1 

1.3 

1.8 

2.7 

10 

1.04 

1.1 

1.3 

1.8 

20 

1.01 

1.04 

1.1 

1.4 

50 

1.001 

1.005 

1.04 

1.1 

Note  that  the  numbers  are  for  bucket  accesses,  not  record  accesses  - hence, 
the  advantage  of  large  bucket  size  'Is  not  as  great  as  would  appear  from  the 
table.  For  example.  If  20  records  are  read  In  one  disk  access,  say  a page. 
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then  a bucket  size  of  1 gives  the  same  performance  as  bucket  size  of  20 
If  the  overflow  sequence  first  generates  all  slots  on  the  page.  Records 
should  be  small,  so  that  many  can  be  read  simultaneously.  Hashing  with 
open  overflow  is  therefore  suitable  for  Indexes,  but  not  for  large  records, 
unless  storage  Is  sufficiently  cheap  that  a low  loading  factor  Is  feasible. 

A chained  overflow  sequence  Is  an  ordinary  chain  of  buckets  In  either 
the  "prime"  area  - that  addressed  by  the  hashing  function  - or  a separate 
overflow  area.  In  the  former  case,  overflow  records  can  be  moved  about  so 
that  each  chain  consists  only  of  records  mapping  Into  a single  home  bucket. 

For  example.  If  division  hashing  Is  used  with  7 buckets  of  size  1,  then  the  . 
result  after  Inserting  1 and  8,  with  linear  probing.  Is 


0 1 

2 

3 4 

5 

6 

1 > t 1 

f 1 

1 1 

1 1 

1 

L 

and  after  Inserting  2 is 
0 1 

J 

2 

3 4 

5 

6 

1 1 TTT 

_JJ 

8 1 

111 

I 1 

c 

The  effect  Is  to  avoid  clustering 

J 

- overflow 

from  one 

bucket  causing,  or 

Increasing,  overflow  from 

another 

bucket,  as 

discussed 

earlier.  Clustering 

will  also  be  avoided  when  using  a separate  overflow  area  If  the  bucket  size 
is  1,  or  by  restricting  each  overflow  bucket  to  be  chained  to  a single  home 
bucket,  or  by  moving  records,  as  above. 

The  following  table  gives  the  average  number  of  buckets  accessed  to 
locate  a record,  assuming  chained  overflow  to  a separate  area  with  bucket  i 

size  1 (Knuth,  1973,  p.  535):^^ 
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Bucket 

Size 

Loading  Factor  | 

.70 

1 1 

IHKitJHi 

.95 

1 

1.4 

mmmm 

1.5 

1.5 

2 

1.2 

1.4 

1.4 

5 

1.1 

1.3 

1.3 

10 

1.06 

1.1 

1.2 

13 

20 

1.02 

1.06 

1.2 

1.2 

50 

1.001 

1.02  1 

1.1 

1.2 

Severance  and  Duhne,^^  and  van  der  present  detailed 

analyses  of  the  effects  of  different  overflow  techniques,  bucket  sizes, 

and  loading  factors  on  over-all  system  performance.  As  noted  earlier, 

the  open  overflow  technique  Is  generally  suitable  only  for  small  records. 

This  technique  Is  quite  sensitive  to  the  loading  factor,  and  therefore 

Is  not  suitable  If  the  number  of  records  varies  greatly.  It  does  have 

the  advantage  that  overflow  records  tend  to  cluster  close  to  the  home 

bucket,  thereby  reducing  track  and  cylinder  accesses.  The  chained 

technique  with  separate  overflow  Is  suitable  for  large  records  or  large 

buckets,  since  In  either  case  the  overhead  of  the  chaining  pointer 

is  negligible.  This  technique  Is  also  relatively  insensitive  to  the 

loading  factor,  so  Is  suitable  for  volatile  files.  Severance  and  Duhne 
54 

(1974,  p.l3)  observe  that  separate  overflow  buckets  should  be  of  size  1 
unless  speed  of  retrieval  Is  much  more  significant  than  storage  costs. 
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Note  that  hashing  techniques  do  not.  In  general,  preserve  order. 

Hence,  if  a large  collection  of  records  is  to  be  retrieved  and  sorted, 
the  collection  Itself  should  be  ordered  (e.g.,  in  a sequential  file) 
and  only  the  index  (or  Indexes)  should  be  in  a hashed-address  table. 

Generalized  Models  and  Summary  of  Examples 

This  subsection  has  defined  various  alternative  physical  access 
paths,  in  order  to  define  commonly  used  structures,  to  Illustrate  some 
of  the  complexities  which  can  be  otained  from  combinations  of  a few 
simple ' Ideas,  and  to  motivate  the  following  subsections.  The  last 
three  examples  - TRIE-TREE,  search  trees,  and  hashing  - present  different 
compromises  among  retrieval  and  maintenance  speed,  storage  efficiency, 
and  reorganization.  Hashing  with  open  overflow  can  be  very  efficient, 
but  Is  quite  sensitive  to  loading  factor  and  patterns  In  the  data  - 
periodic  reorganization  Is  necessary  to  avoid  long  sequences  of  synonyms, 
given  a reasonably  high  loading  factor.  Chained  overflow  Is  less  sensi- 
tive to  loading  factor,  but  chains  caused  by  data  patterns  may  be  less 
localized  on  peripheral  storage,  and  so  require  more  accesses.  The 
TRIE-TREE  combination  behaves  much  like  hashing  with  chained  overflow  - 
e.g.,  long  chains  In  the  TREE  may  require  reorganization  of  either  TRIE 
or  TREE.  The  search  tree  Is  not  particularly  outstanding  In  any  respect, 
but  does  have  the  advantage  that  reorganization  can  be  continual,  thus 
providing  upper  bounds  on  retrieval  times. 

Severance^^’^^  describes  a generalized  structure  for  character  string 
searches,  which  consists  of  a one-level  prefix  search  with  a TRIE,  a one- 
level  prefix  search  with  a TREE,  and  a search  of  a sequence  of  synonyms. 
Five  parameters  completely  characterize  a structure:  1)  the  number  of 
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characters  used  in  the  TRIE  search  (If  Ot  then  of  course  there  is  no  TRIE 


search),  2)  a Boolean  variable  indicating  whether  the  result  of  the  TRIE 


search  is  a data  or  structure  record,  3)  the  number  of  characters,  if 


in  the  TREE  search,  4)  the  fraction  of  chained  nodes  in  the  TREE 


and  5)  the  fraction  of  chained  records  in  the  final  set  of  synonyms 


Refer  to  the  example  of  the  set  of  colors,  discussed  earlier  under  TRIE 


and  TREE.  Various  combinations  of  parameter  values  may  be  chosen  to 


model  many  common  access  paths  (search  trees  and  hashing  with  open  over- 


flow do  not,  however,  seem  to  be  representable  in  Severance's  model) 


For  example,  if  the  number  of  characters  in  the  TREE  search  is  equal  to 


the  total  number  of  characters  in  the  string,  then  the  result  is  either 


direct  addressing  or  addressing  through  an  index,  depending  on  whether 


the  second  parameter  indicates  mapping  into  data  or  structure.  If  the 


TRIE  maps  into  a sequence  of  records  or  pointers,  and  there  is  no  TREE, 


then  the  result  Is  an  Inverted  file.  If  there  is  no  TRIE  search,  but 


there  is  a TREE  search,  then  the  result  is  an  indexed  sequential  file 


If  there  is  neither  TRIE  nor  TREE  search,  then  the  result  is  an  array 


a list,  or  an  array  with  overflow,  according  to  whether  the  fraction 


of  chained  synonyms  is  zero,  one,  or  an  intermediate  value.  If  the  TRIE 


involves  some  but  not  all  of  the  total  number  of  characters,  there  is  no 


TREE,  and  if  the  synonyms  are  chained,  then  the  result  is  division  hashing 


with  chained  overflow 


has  developed  a computer  program  which,  given  charac- 


teristics of  hardware  (speeds  and  sizes),  data  (sizes,  volatility,  and 


58.  Severance,  D.  G.,  "A  Simulation  Model  for  Basic  File  Organiza- 
tions, ISDOS  Working  Paper  154”,  ISDOS  Project,  Department  of  Industrial 
and  Operations  Engineering,  The  University  of  Michigan,  Ann  Arbor  (March 
1972). 


usage),  and  design  objectives  (relative  costs  of  space,  maintenance  time, 
and  retrieval  time) , will  generate  and  evaluate  combinations  of  parameters 

and  produce  a near-optimal  data  base  design. 

21 

Yao  describes  a somewhat  more  general  model,  which  allows  for  an 
arbitrary  number  of  levels  of  TRIEs  and  TREEs.  Yao  has  constructed  a 
program,  based  on  this  model,  for  generating  and  evaluating  near-optimal 
data  base  designs.  Comparisons  with  other  models  suggest  that  Yao's  model, 
cost  equations,  and  optimization  techniques  are  extremely  accurate.  Like 
Severance's  model,  Yao's  model  does  not  seem  able  to  represent  search 
trees  and  hashing  with  open  overflow.  Both  are  based  on  a TRIE-TREE 
sequence,  which  Is  not  always  appropriate.  For  example.  If  parts  are 
identified  by  a string  of  two  letters  (the  actually  occurring  letters  are 
approximately  consecutive,  and  hence  dense),  followed  by  three  numbers 
(sparse),  followed  by  one  letter  to  denote  sub-parts  (consecutive),  then 
the  best  organization  Is  a TRIE-TREE* TRIE,  as  follows: 


KL 

,1 

KM 

r 

KN 

Either  one  three-letter  TRIE  or  two  consecutive  TRIEs,  followed  by  a TREE 
would  be  Inefficient  In  storage,  assuming  that  the  number  of  subparts 
depends  on  the  part.  Assuming  that  sub-parts  are  frequently  required 
consecutively,  either  would  also  require  more  time  and  peripheral  accesses. 
A three-number,  one-letter  TREE  would  be  inefficient  in  time,  since  the 
length  of  the  TREE  chains  would  be  long.  Neither  optimization  program  is 
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able  to  exploit  non-tmiform  record  activity,  nor  to  assign  costs  to 
In-core  buffers,  nor  to  deal  with  more  than  two  storage  levels  (e.g., 
neither  can  model  staging  from  archival  to  on-line  storage).  However, 
both  programs  are  great  Improvements  over  the  educated  guesswork  of 
traditional  file  design.  Unfortunately,  there  are  no  published  figures 
on  the  relative  costs  and  accuracies  of  Severance's  and  Yao's  models. 

PARAMETERS  OF  PHYSICAL  ACCESS  PATHS 

The  subsection  on  Classifying  Physical  Access  Paths  described  the 
basic  concepts  of  access  path  design:  data  vs.  structure,  contiguity 

vs.  chaining,  ordering  vs.  no  ordering,  and  direct  vs.  sequential  access. 
As  the  previous  subsection  on  examples  Illustrates,  even  fairly  simple 
combinations  of  blocks  can  lead  to  the  necessity  for  making  a large  num- 
ber of  decisions,  many  of  which  are  by  no  means  obvious.  This  section 
presents  a further  classification  of  these  decision  points. 


Block  Structure 

The  basic  block  structure  must,  of  course,  be  determined.  Organi- 
zations at  each  level,  such  as  TRIE  and  hashing,  TREE,  and  various  search 
trees,  were  discussed  above. 

Free  Space 

The  allocation,  utilization,  and  recovery  of  free  space  is  Itself  a 
problem  of  designing  physical  access  paths.  Free  space  may  be  located  In 
one  overflow  area,  as  In  ISAM,  or  It  may  be  distributed  throughout  a 

59 

number  of  levels  of  blocks,  as  In  IBM's  Virtual  Storage  Access  Method 

59.  Keehn,  D.  G.,  and  J.  0.  Lacy,  "VSAM  Data  Set  Design  Parameters", 
IBM  Systems  J.,  Vol.  13,  No.  3,  pp.  186-212  (1974). 
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(Martin,  1975,  pp.  283-290).  Instead  of  a physical  block  overflowing, 
as  It  does  In  ISAM,  It  Is  split  and  roughly  one-half  Is  moved  Into  a new 
area  of  free  space;  the  old  and  new  areas  are  then  each  about  one-half 
full.  The  Index  Is  updated,  which  may  cause  another  level  of  overflow. 
Space  occupied  by  deleted  records  Is  immediately  recovered.  Insertion 
and  deletion  are  therefore  more  time-consuming  than  In  ISAM,  but  retrieval 
Is  faster,  since  overflow  chains  do  not  build  up,  and  reorganization  need 
not  be  performed  as  frequently.  In  VSAM,  as  In  hashing  with  open  overflow, 
the  loading  factor  Is  quite  Important  - too  little  free  space  increases 
insertion  time,  while  Coo  much  is  expensive  in  space  and  time  since 
records  are  spread  over  more  area. 


Record  Design 

Record  size  and  possible  segmentation  are  clearly  of  great  impor- 
tance, particularly  since  Che  suitability  of  some  physical  access  paths 
(especially  hashing  with  open  overflow)  is  critically  dependent  on 
record  size.  The  previous  section  discussed  some  alternatives. 


Hierarchical  Storage 

Different  segments  of  a record,  or  records  with  different  access 


I 
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frequencies,  may  be  placed  on  different  levels  of  hierarchical  storage. 

60 

Lum  et  al  present  a cost  analysis  and  various  examples  of  allocation 
of  a data  base  Co  different  levels.  Their  results  could  be  combined  with 
the  results  of  Eisner  and  Severance^^  on  record  segmentation. 


60.  Lum,  V,  Y.,  M.  E.  Senko,  C.  P.  Wang,  and  H.  Ling,  "A  Cost 
Oriented  Algorithm  for  Data  Set  Allocation  In  Storage  Hierarchies", 
CACM,  Vol.  18,  No.  6,  pp.  318-322  (June  1975). 
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irriLIZAXIOM  PARAMETERS 


Some  requirements  which  must  be  met  by  a data  baae,  and  relevant 
design  decisions,  are  described  In  the  following  paragraphs: 

Rapid  Response  to  Simple  On-Line  Queries 

Accesses  to  peripheral  storage  should  be  minimized.  Adherence  to 
a user-oriented  order  Is  not  particularly  Important,  since  output  records 
can  be  quickly  sorted.  If  necessary,  during  report  generation.  Direct 
access  to  the  data,  or  a very  simple  Index  structure.  Is  necessary.  Hence 
a shallow  block  structure,  contiguity,  and  order  are  required. 

Rapid  Response  to  Complex  On-Line  Queries 

Direct  access  to  the  data  is  generally  unacceptable,  since  too  many 
large  records  would  be  examined.  A deep  block  structure  can  be  used  to 
subdivide  the  data  base  Into  relevant  areas  to  be  searched  exhaustively, 
or  multiple  indexes  (such  as  Inverted  file  or  bounded  multilist  indexes) 
can  be  used  so  that  query  resolution  can  be  performed  on  pointers  Instead 
of  records.  Indexes  should  be  ordered  and  contiguous  to  permit  direct 
access. 

Hlgh-Volume  Off-Line  Retrieval 

If  a single  user-defined  report  order  Is  required,  then  records 
should  be  contiguous  in  that  order  to  reduce  access  time  and  avoid  sub- 
sequent sorting  (refer  to  the  analysis  of  the  multilist  retrieval).  If 
other  users  require  different  report  orders,  separate  Indexes  of  pointers 
may  be  maintained,  and  sorting  can  then  be  performed  by  constructing  chan- 
nel programs. 
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On-Line  Maintenance 

Insertion  may  either  be  done  Immediately  or  be  deferred  by  means  of 
a special  block  which  accumulates  Insertions.  Distributed  free  space 
facilitates  Insertion  Into  ordered,  contiguous  blocks;  many  levels  of 
blocks,  each  with  free  space,  minimize  the  amount  of  re-order Ing  necessary 
In  such' Insertions . Deletion  should  be  done  by  logically  marking  deleted 
records,  which  are  then  physically  replaced  by  other  records  only  when 
their  space  Is  needed.  Modification  of  an  Indexed  field  on  a direct  ac- 
cess device,  or  of  any  field  on  a tape.  Is  a deletion  followed  by  an  In- 
sertion; modification  of  a non- Indexed  field  on  a direct-access  device  Is 
a simple  search  followed  by  a rewrite  In  place.  Deferral  of  maintenance 
Is  attractive,  because  response  times  are  more  uniform  and  maintenance 
generally  more  efficient.  Insertions  can  be  ordered  to  reduce  physical 
accesses  and  movement  of  data.  However,  the  Insertion  block  must  be  ac- 
cessed separately,  and  In  Itself  presents  a problem  In  access  path  design. 

High-Volume  Continuous  Maintenance 

Highly  volatile  files  which  require  a user-defined  order  should  be 
structured  In  many  block  levels  with  distributed  free  storage  to  minimize 
movement  of  data.  Records  should  be  chained  rather  than  contiguous. 

Chains  should  be  maintained  separately  rather  than  embedded  In  the  data. 

If  necessary,  embedded  chains  should  be  two-way  to  simplify  linking  and 
unlinking.  If  possible,  the  order  of  records  should  be  determined  by  the 
availability  of  free  space.  Data  base  keys  and  a translation  table  should 
be  used  rather  than  physical  addresses,  to  reduce  Index  updating  If  data 
records  are  moved. 
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Storage  coats 'can  be  reduced  by  eliminating  redundant  records,  by  al- 
locating infrequently  used  blocks  of  a data  base  to  leas  expensive  devices, 
and  by  compression  techniques,  as  in  the  preceding  section.  Pointer  size 
can  be  reduced  by  the  use  of  data  base  keys,  which  can  be  much  shorter 
than  physical  addresses,  and  a translation  table.  This  is  particularly 
significant  In  an  Inverted  file.  Note  that  compression  techniques  can  be 
used  very  effectively  on  indexes. 

Loading  and  Reorganization 

The  efficiency  of  such  structures  as  search  trees  and  hashing  tables 
can  be  very  high  If  the  distribution  of  record  usage  is  highly  skewed  and 
loading  is  in  descending  order  of  record  usage.  Frequent  reorganization 
tends  to  reduce  the  adverse  effects  of  maintenance  activity  on  relatively 
static  structures  with  simple  free-space  structures,  such  as  ISAM  or  hash- 
ing with  open  overflow.  Such  structures  are  also  relatively  easy  to  reor- 
ganize. 

SUMMARY  OF  PHYSICAL  ACCESS  PATH  DESIGN 

This  section  presented  the  basic  concepts  with  which  one  can  design 
physical  access  paths  that  provide  reasonable  compromises  among  competing 
objectives,  such  as  retrieval  speed  and  storage  cost.  Commonly  used  ex- 
amples were  presented  to  illustrate  the  application  of  the  concepts. 

Various  design  and  evaluation  models  were  cited.  Finally,  very  rough 
generalizations  were  made  concerning  the  attainment  of  selected  specific 
obj ectives. 
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SUMMARY 

The  objective  of  this  report  has  been  to  divide  the  broad  subject 
of  data  base  design  Into  four  separate  areas  - logical  record  design, 
logical  access  path  design,  physical  record  design,  and  physical  access 
path  design  - and  to  provide  a more  or  less  Independent  analysis  of 
each  area.  Logical  record  design  has  been  shovm  to  be  a consequence  of 
the  real-world  relationships  being  modelled.  Logical  access  path  design 
has  been  discussed  as  a choice  among  competing  models  - hierarchical, 
network,  and  relational  - which  all  offer  advantages  and  disadvantages. 
Physical  record  design  has  been  primarily  oriented  toward  reduction  of 
storage  costs.  Record  compression  and  segmentation  have  been  discussed. 
Physical  access  path  design  has  been  the  subject  of  the  most  lengthy  and 
least  decisive  discussion.  Objectives  such  as  retrieval  speed  and  storage 
efficiency  generally  conflict,  and  the  number  of  possible  combinations  of 
basic  elements  is  so  large  that  only  broad  generalizations  have  been 
made.  Numerous  design  and  evaluation  models  have  been  cited. 
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